Natural Language Processing
Research on language models, text understanding, generation, and computational linguistics.
cs.CL ยท 805 papersA Comparative Study of Controlled Text Generation Systems Using Level-Playing-Field Evaluation Principles
This paper introduces a standardized 'level-playing-field' evaluation for controlled text generation systems, revealing many systems underperform original claims.
Scalable Token-Level Hallucination Detection in Large Language Models
TokenHD is a scalable pipeline for training token-level hallucination detectors in LLMs, outperforming larger models in detecting reasoning errors.
Pretraining Exposure Explains Popularity Judgments in Large Language Models
LLMs' popularity judgments are primarily driven by pretraining data exposure, not external popularity, as shown by analyzing OLMo and Dolma.
Context Convergence Improves Answering Inferential Questions
This paper shows that constructing passages with high "context convergence" significantly improves LLM accuracy on inferential question answering tasks.
MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering
MedHopQA is a new disease-centered multi-hop reasoning benchmark for evaluating LLMs in biomedical QA, designed to resist saturation and contamination.
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
The MedHopQA track benchmarked LLMs on multi-hop medical QA with a new 1,000-pair dataset, highlighting RAG's importance for strong performance.
Reconstruction of Personally Identifiable Information from Supervised Finetuned Models
This paper reveals that PII can be reconstructed from supervised finetuned LLMs, proposing COVA to enhance reconstruction under prefix attacks.
Design Your Ad: Personalized Advertising Image and Text Generation with Unified Autoregressive Models
This paper introduces Uni-AdGen, a unified autoregressive model for personalized image and text ad generation, improving realism and user preference.
World Action Models: The Next Frontier in Embodied AI
This survey introduces World Action Models (WAMs), a new embodied AI paradigm unifying predictive state modeling with action generation, providing a systematic overview.
Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
This paper presents a three-stage multi-turn retrieval system using query rewriting, hybrid search, and cross-encoder reranking for SemEval-2026 Task 8.
SkillSafetyBench: Evaluating Agent Safety under Skill-Facing Attack Surfaces
SkillSafetyBench evaluates how reusable skills in LLM agents create new attack surfaces, revealing vulnerabilities beyond model-level alignment.
StepCodeReasoner: Aligning Code Reasoning with Stepwise Execution Traces via Reinforcement Learning
StepCodeReasoner uses RL to align code reasoning with stepwise execution traces, achieving SOTA performance by supervising intermediate states.
AgentDisCo: Towards Disentanglement and Collaboration in Open-ended Deep Research Agents
AgentDisCo is a novel agentic architecture that disentangles information exploration and exploitation for deep research, achieving self-refinement and strong performance.
Much of Geospatial Web Search Is Beyond Traditional GIS
This paper reveals that geospatial web search is far more prevalent and practically oriented than previously understood, often exceeding traditional GIS capabilities.
Unlocking LLM Creativity in Science through Analogical Reasoning
Analogical Reasoning (AR) enables LLMs to generate significantly more diverse and novel solutions for scientific problems, mitigating mode collapse.
Decomposing Evolutionary Mixture-of-LoRA Architectures: The Routing Lever, the Lifecycle Penalty, and a Substrate-Conditional Boundary
This paper decomposes an evolutionary Mixture-of-LoRA system, finding that router improvements, not the evolutionary lifecycle, drive performance gains.
ELF: Embedded Language Flows
ELF proposes a continuous diffusion model for language, leveraging flow matching in embedding space to achieve superior generation quality with fewer steps.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO is a sparse MoE model matching dense performance on end-side devices, offering 3x speedup and reduced storage overhead.
Dynamic Skill Lifecycle Management for Agentic Reinforcement Learning
SLIM dynamically manages external skills for LLM agents in RL, optimizing their active skill set for improved task performance.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench introduces a new benchmark for evaluating long-horizon, real-world agents using native runtimes and real tools.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.