Natural Language Processing
Research on language models, text understanding, generation, and computational linguistics.
cs.CL ยท 805 papersRubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
RubricEM is a meta-RL framework that uses rubrics to guide policy decomposition and reflection for training research agents without verifiable rewards.
Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking
BICR improves LVLM confidence estimation by contrasting real and blind image inputs, detecting visual ungroundedness with high accuracy and efficiency.
Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs
Neural1.5 uses modular prompt optimization and self-consistency to achieve strong results in clinical QA over EHRs, ranking second overall.
Compute Where it Counts: Self Optimizing Language Models
Self-Optimizing Language Models (SOL) dynamically allocate computation per token, improving LLM inference efficiency and quality over static methods.
DGPO: Beyond Pairwise Preferences with Directional Consistent Groupwise Optimization
DGPO is a new preference optimization method for LLMs that improves directional consistency and reasoning diversity using group-wise, multi-candidate comparisons.
RUBEN: Rule-Based Explanations for Retrieval-Augmented LLM Systems
RUBEN is an interactive tool that uses minimal rule sets and novel pruning to explain retrieval-augmented LLM outputs and test their safety and resilience.
Learning More from Less: Exploiting Counterfactuals for Data-Efficient Chart Understanding
ChartCF improves VLM chart understanding data-efficiently by leveraging counterfactuals through novel data synthesis, selection, and multimodal optimization.
Grounded Satirical Generation with RAG
This paper introduces a RAG-based pipeline for grounded satire generation, finding it improves political relevance but not humor.
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test (GTT) offers a formal, dataset-agnostic framework to compare AI agent intelligence via indistinguishability.
Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?
Pi-Serini demonstrates that well-tuned lexical retrieval with capable LLMs can effectively support deep agentic search, outperforming dense retrievers.
BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation
BabelDOC is an IR-based framework that accurately translates PDFs while preserving their original visual layout and improving terminology consistency.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA is a training-free, inference-time method that culturally aligns LLMs by leveraging within-country sociodemographic disagreement, improving fairness.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
ODE enhances multimodal deep search agents via an image bank for reusable visual evidence and on-policy data evolution, improving performance significantly.
SLIM: Sparse Latent Steering for Interpretable and Property-Directed LLM-Based Molecular Editing
SLIM enhances LLM molecular editing by using sparse latent steering to precisely control properties and improve success rates.
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER dynamically routes between reasoning and non-reasoning LLM judges to optimize accuracy and cost, especially under distribution shift.
The Last Word Often Wins: A Format Confound in Chain-of-Thought Corruption Studies
Chain-of-thought corruption studies are confounded by explicit answer formats; models often follow the final answer text, not the reasoning.
LITMUS: Benchmarking Behavioral Jailbreaks of LLM Agents in Real OS Environments
LITMUS benchmarks LLM agent behavioral jailbreaks in real OS environments, revealing critical safety gaps and a new "Execution Hallucination" phenomenon.
Step Rejection Fine-Tuning: A Practical Distillation Recipe
Step Rejection Fine-Tuning (SRFT) improves LLM agent training by leveraging partially correct, unresolved trajectories, outperforming standard RFT.
When Can Digital Personas Reliably Approximate Human Survey Findings?
This paper evaluates when LLM-powered digital personas can reliably substitute human survey respondents, finding they align distributionally but struggle with individual predictions.
LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation
LLARS is an open-source platform enabling domain experts and developers to collaboratively engineer, generate, and evaluate LLM outputs efficiently.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.