Natural Language Processing

Research on language models, text understanding, generation, and computational linguistics.

cs.CL · 805 papers

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

RubricEM is a meta-RL framework that uses rubrics to guide policy decomposition and reflection for training research agents without verifiable rewards.

2605.10899May 11, 2026Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang +9

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

BICR improves LVLM confidence estimation by contrasting real and blind image inputs, detecting visual ungroundedness with high accuracy and efficiency.

2605.10893May 11, 2026Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur +4

Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs

Neural1.5 uses modular prompt optimization and self-consistency to achieve strong results in clinical QA over EHRs, ranking second overall.

2605.10877May 11, 2026Abrar Majeedi, Viswanatha Reddy Gajjala, Sai Prasanna Teja Reddy Bogireddy +1

Compute Where it Counts: Self Optimizing Language Models

Self-Optimizing Language Models (SOL) dynamically allocate computation per token, improving LLM inference efficiency and quality over static methods.

2605.10875May 11, 2026Yash Akhauri, Mohamed S. Abdelfattah

DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization

DGPO is a new preference optimization method for LLMs that improves directional consistency and reasoning diversity using group-wise, multi-candidate comparisons.

2605.10863May 11, 2026Mengyi Deng, Zhiwei Li, Xin Li +4

RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems

RUBEN is an interactive tool that uses minimal rule sets and novel pruning to explain retrieval-augmented LLM outputs and test their safety and resilience.

2605.10862May 11, 2026Joel Rorseth, Parke Godfrey, Lukasz Golab +2

Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding

ChartCF improves VLM chart understanding data-efficiently by leveraging counterfactuals through novel data synthesis, selection, and multimodal optimization.

2605.10855May 11, 2026Jianzhu Bao, Haozhen Zhang, Kuicai Dong +5

Grounded Satirical Generation with RAG

This paper introduces a RAG-based pipeline for grounded satire generation, finding it improves political relevance but not humor.

2605.10853May 11, 2026Oona Itkonen, Yuxin Su, Linyao Du +1

The Generalized Turing Test: A Foundation for Comparing Intelligence

The Generalized Turing Test (GTT) offers a formal, dataset-agnostic framework to compare AI agent intelligence via indistinguishability.

2605.10851May 11, 2026Daniel Mitropolsky, Susan S. Hong, Riccardo Neumarker +2

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Pi-Serini demonstrates that well-tuned lexical retrieval with capable LLMs can effectively support deep agentic search, outperforming dense retrievers.

2605.10848May 11, 2026Tz-Huan Hsu, Jheng-Hong Yang, Jimmy Lin

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

BabelDOC is an IR-based framework that accurately translates PDFs while preserving their original visual layout and improving terminology consistency.

2605.10845May 11, 2026Qi Yang, Xiangyao Ma, Xiao Wang +2

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

DISCA is a training-free, inference-time method that culturally aligns LLMs by leveraging within-country sociodemographic disagreement, improving fairness.

2605.10843May 11, 2026Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen +5

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

ODE enhances multimodal deep search agents via an image bank for reusable visual evidence and on-policy data evolution, improving performance significantly.

2605.10832May 11, 2026Shijue Huang, Hangyu Guo, Chenxin Li +7

SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing

SLIM enhances LLM molecular editing by using sparse latent steering to precisely control properties and improve success rates.

2605.10831May 11, 2026Mingxu Zhang, Yuhan Li, Lujundong Li +3

Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge

RACER dynamically routes between reasoning and non-reasoning LLM judges to optimize accuracy and cost, especially under distribution shift.

2605.10805May 11, 2026Wenbo Zhang, Lijinghua Zhang, Liner Xiang +1

The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies

Chain-of-thought corruption studies are confounded by explicit answer formats; models often follow the final answer text, not the reasoning.

2605.10799May 11, 2026Gabriel Garcia

LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments

LITMUS benchmarks LLM agent behavioral jailbreaks in real OS environments, revealing critical safety gaps and a new "Execution Hallucination" phenomenon.

2605.10779May 11, 2026Chiyu Zhang, Huiqin Yang, Bendong Jiang +8

Step Rejection Fine-Tuning: A Practical Distillation Recipe

Step Rejection Fine-Tuning (SRFT) improves LLM agent training by leveraging partially correct, unresolved trajectories, outperforming standard RFT.

2605.10674May 11, 2026Igor Slinko, Ilia Zavidnyi, Egor Bogomolov +1

When Can Digital Personas Reliably Approximate Human Survey Findings?

This paper evaluates when LLM-powered digital personas can reliably substitute human survey respondents, finding they align distributionally but struggle with individual predictions.

2605.10659May 11, 2026Mumin Jia, Yilin Chen, Divya Sharma +1

LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

LLARS is an open-source platform enabling domain experts and developers to collaboratively engineer, generate, and evaluate LLM outputs efficiently.

2605.10593May 11, 2026Philipp Steigerwald, Mara Stieler, Jennifer Burghardt +2

PreviousPage 4 of 41Next

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.