ArXiv TLDR
โ† All categories

Natural Language Processing

Research on language models, text understanding, generation, and computational linguistics.

cs.CL ยท 805 papers

A Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles

This paper introduces a standardized 'level-playing-field' evaluation for controlled text generation systems, revealing many systems underperform original claims.

2605.12395May 12, 2026Michela Lorandi, Anya Belz

Scalable Token-Level Hallucination Detection in Large Language Models

TokenHD is a scalable pipeline for training token-level hallucination detectors in LLMs, outperforming larger models in detecting reasoning errors.

2605.12384May 12, 2026Rui Min, Tianyu Pang, Chao Du +2

Pretraining Exposure Explains Popularity Judgments in Large Language Models

LLMs' popularity judgments are primarily driven by pretraining data exposure, not external popularity, as shown by analyzing OLMo and Dolma.

2605.12382May 12, 2026Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

Context Convergence Improves Answering Inferential Questions

This paper shows that constructing passages with high "context convergence" significantly improves LLM accuracy on inferential question answering tasks.

2605.12370May 12, 2026Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

MedHopQA is a new disease-centered multi-hop reasoning benchmark for evaluating LLMs in biomedical QA, designed to resist saturation and contamination.

2605.12361May 12, 2026Rezarta Islamaj, Robert Leaman, Joey Chan +13

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

The MedHopQA track benchmarked LLMs on multi-hop medical QA with a new 1,000-pair dataset, highlighting RAG's importance for strong performance.

2605.12313May 12, 2026Rezarta Islamaj, Joey Chan, Robert Leaman +13

Reconstruction of Personally Identifiable Information from Supervised Finetuned Models

This paper reveals that PII can be reconstructed from supervised finetuned LLMs, proposing COVA to enhance reconstruction under prefix attacks.

2605.12264May 12, 2026Sae Furukawa, Alina Oprea

Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models

This paper introduces Uni-AdGen, a unified autoregressive model for personalized image and text ad generation, improving realism and user preference.

2605.12138May 12, 2026Yexing Xu, Wei Feng, Shen Zhang +15

World Action Models: The Next Frontier in Embodied AI

This survey introduces World Action Models (WAMs), a new embodied AI paradigm unifying predictive state modeling with action generation, providing a systematic overview.

2605.12090May 12, 2026Siyin Wang, Junhao Shi, Zhaoyang Fu +11

Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking

This paper presents a three-stage multi-turn retrieval system using query rewriting, hybrid search, and cross-encoder reranking for SemEval-2026 Task 8.

2605.12028May 12, 2026David-Maximilian Caraman, Gheorghe Cosmin Silaghi

SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces

SkillSafetyBench evaluates how reusable skills in LLM agents create new attack surfaces, revealing vulnerabilities beyond model-level alignment.

2605.12015May 12, 2026Chang Jin, An Wang, Zeming Wei +7

StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning

StepCodeReasoner uses RL to align code reasoning with stepwise execution traces, achieving SOTA performance by supervising intermediate states.

2605.11922May 12, 2026Hao Wang, Rui Li, Lei Sha +1

AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents

AgentDisCo is a novel agentic architecture that disentangles information exploration and exploitation for deep research, achieving self-refinement and strong performance.

2605.11732May 12, 2026Jiarui Jin, Zexuan Yan, Shijian Wang +2

Much of Geospatial Web Search Is Beyond Traditional GIS

This paper reveals that geospatial web search is far more prevalent and practically oriented than previously understood, often exceeding traditional GIS capabilities.

2605.11336May 11, 2026Ilya Ilyankou, Stefano Cavazzi, James Haworth

Unlocking LLM Creativity in Science through Analogical Reasoning

Analogical Reasoning (AR) enables LLMs to generate significantly more diverse and novel solutions for scientific problems, mitigating mode collapse.

2605.11258May 11, 2026Andrew Shen, Shaul Druckmann, James Zou

Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary

This paper decomposes an evolutionary Mixture-of-LoRA system, finding that router improvements, not the evolutionary lifecycle, drive performance gains.

2605.11153May 11, 2026Ramchand Kumaresan

ELF: Embedded Language Flows

ELF proposes a continuous diffusion model for language, leveraging flow matching in embedding space to achieve superior generation quality with fewer steps.

2605.10938May 11, 2026Keya Hu, Linlu Qiu, Yiyang Lu +5

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

DECO is a sparse MoE model matching dense performance on end-side devices, offering 3x speedup and reduced storage overhead.

2605.10933May 11, 2026Chenyang Song, Weilin Zhao, Xu Han +3

Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning

SLIM dynamically manages external skills for LLM agents in RL, optimizing their active skill set for improved task performance.

2605.10923May 11, 2026Junhao Shen, Teng Zhang, Xiaoyan Zhao +1

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

WildClawBench introduces a new benchmark for evaluating long-horizon, real-world agents using native runtimes and real tools.

2605.10912May 11, 2026Shuangrui Ding, Xuanlang Dai, Long Xing +14
PreviousPage 3 of 41Next

๐Ÿ“ฌ Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week โ€” summarized, scored, and delivered to your inbox every Monday.