ArXiv TLDR
← All categories

Natural Language Processing

Research on language models, text understanding, generation, and computational linguistics.

cs.CL · 805 papers

WARDEN: Endangered Indigenous Language Transcription and Translation with 6 Hours of Training Data

WARDEN is a novel two-stage system that transcribes and translates endangered Wardaman to English using only 6 hours of audio, outperforming larger models.

2605.13846May 13, 2026Ziheng Zhang, Yunzhong Hou, Naijing Liu +1

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench is a new end-to-end framework for evaluating voice agents using realistic bot-to-bot audio simulations and novel composite metrics.

2605.13841May 13, 2026Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz +10

Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights

TFlow enables multi-agent LLMs to communicate via transient weight perturbations, boosting efficiency and accuracy over text-based methods.

2605.13839May 13, 2026Wenrui Bao, Huan Wang, Jian Wang +3

Negation Neglect: When models fail to learn negations in training

LLMs finetuned on documents that flag claims as false often learn to believe those claims are true, a phenomenon called Negation Neglect.

2605.13829May 13, 2026Harry Mayne, Lev McKinney, Jan Dubiński +3

An LLM-Based System for Argument Reconstruction

This paper introduces an LLM-based system that reconstructs natural language arguments into abstract argument graphs, showing potential for scalable analysis.

2605.13793May 13, 2026Paulo Pirozelli, Victor Hugo Nascimento Rocha, Fabio G. Cozman +1

Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry

This paper introduces a novel method for detecting step-level hallucinations in LLMs by analyzing hidden-state transport geometry during a single forward pass.

2605.13772May 13, 2026Tyler Alvarez, Ali Baheri

Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching

This paper compares dense and MoE transformers at tiny scale, finding MoE outperforms dense when matching active parameters but not total parameters.

2605.13769May 13, 2026Abdalrahman Wael

Senses Wide Shut: A Representation-Action Gap in Omnimodal LLMs

Omnimodal LLMs struggle to reject false textual claims contradicting sensory input, revealing a "Representation-Action Gap" in grounding.

2605.13737May 13, 2026Trung Nguyen Quang, Yiming Gao, Fanyi Pu +3

Children's English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety

Fine-tuning compact 8B LLMs with expert curricula generates children's English stories with controllable difficulty and safety, outperforming larger models.

2605.13709May 13, 2026Qian Shen, Fanghua Cao, Min Yao +3

RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning

RTLC, a three-stage prompting paradigm inspired by Feynman, significantly boosts LLM-as-judge accuracy on JudgeBench without fine-tuning.

2605.13695May 13, 2026Andrea Morandi

Fine-tuning with Hierarchical Prompting for Robust Propaganda Classification Across Annotation Schemas

A new intent-focused propaganda taxonomy and hierarchical prompting (HiPP) significantly improve robust propaganda classification, especially after fine-tuning.

2605.13663May 13, 2026Lukas Stähelin, Veronika Solopova, Max Upravitelev +6

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

Low-rank pre-training methods yield geometrically distinct solutions from full-rank models and each other, even with similar perplexity, requiring deeper evaluation metrics.

2605.13652May 13, 2026Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira +1

FlowCompile: An Optimizing Compiler for Structured LLM Workflows

FlowCompile is an optimizing compiler for structured LLM workflows that explores design space at compile-time to find efficient, reusable configurations.

2605.13647May 13, 2026Junyan Li, Zhang-Wei Hong, Maohao Shen +2

Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation

A new on-policy distillation method, "Prefix Teach, Suffix Fade," improves strong-to-weak model training by focusing supervision on locally teachable trajectory segments.

2605.13643May 13, 2026Kaiyuan Liu, Ziyuan Zhuang, Yang Bai +3

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization

RDPO improves multi-objective and mixed-reward RL by decorrelating rewards and stabilizing advantage allocation for diverse reward types.

2605.13641May 13, 2026Yang Bai, Kaiyuan Liu, Ziyuan Zhuang +5

Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction

This paper introduces edit-level majority voting to reduce over-correction in LLM-based grammatical error correction, improving performance.

2605.13624May 13, 2026Takumi Goto, Yusuke Sakai, Taro Watanabe

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

Automatic metrics and LLM judges poorly evaluate creativity in literary translations, often penalizing creative solutions and showing bias towards machine output.

2605.13596May 13, 2026Kyo Gerrits, Rik van Noord, Ana Guerberof Arenas

Inducing Artificial Uncertainty in Language Models

A new method induces artificial uncertainty in language models on easy data, improving their calibration and uncertainty quantification on challenging tasks.

2605.13595May 13, 2026Sophia Hager, Simon Zeng, Nicholas Andrews

RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

RealICU is a new benchmark for evaluating LLM agents on long-context ICU data, revealing recall-safety tradeoffs and anchoring biases in existing models.

2605.13542May 13, 2026Chengzhi Shen, Weixiang Shen, Tobias Susetzky +8

Locale-Conditioned Few-Shot Prompting Mitigates Demonstration Regurgitation in On-Device PII Substitution with Small Language Models

An on-device PII substitution pipeline uses locale-conditioned few-shot prompting to prevent SLM regurgitation, though rule-based methods aid downstream NER more.

2605.13538May 13, 2026Anuj Sadani, Deepak Kumar
Page 1 of 41Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.