LLM-Oriented Information Retrieval: A Denoising-First Perspective

May 1, 20262605.00505

Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang + 2 more

cs.IRcs.AIcs.CL

TLDR

This paper argues that denoising is the primary bottleneck in LLM-oriented information retrieval, proposing a framework and techniques.

Key contributions

Identifies denoising as the primary bottleneck for LLM-oriented IR due to LLMs' noise vulnerability.
Introduces a four-stage framework for IR challenges: inaccessible, undiscoverable, misaligned, unverifiable.
Presents a pipeline-organized taxonomy of signal-to-noise optimization techniques for LLM-IR.
Highlights denoising research in domains like lifelong assistants, coding agents, and multimodal understanding.

Why it matters

This paper is crucial because LLMs are increasingly consuming information retrieval outputs. Noise in these outputs directly leads to hallucinations and reasoning failures in LLMs, making denoising a critical challenge. It offers a comprehensive perspective and practical techniques to address this growing problem.

Original Abstract

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers