Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Research Agents

April 3, 20262604.03173

Delip Rao, Eric Wong, Chris Callison-Burch

cs.CL

TLDR

This paper measures and corrects citation hallucinations in LLMs and research agents, finding 3-13% of URLs are fabricated.

Key contributions

Systematically measured citation URL validity across 10 LLMs/agents using over 200k URLs.
Found 3-13% of citations are hallucinated (never existed) and 5-18% are non-resolving.
Deep research agents hallucinate more URLs; domain effects and model-specific failure modes identified.
Released `urlhealth`, an open-source tool that reduces non-resolving citations by 6-79x via self-correction.

Why it matters

This paper offers the first systematic measurement of citation reliability in LLMs and agents, revealing a significant hallucination problem. Its open-source tool, `urlhealth`, provides a practical solution, enabling models to self-correct and drastically improve citation accuracy, crucial for trustworthy AI.

Original Abstract

Large language models and deep research agents supply citation URLs to support their claims, yet the reliability of these citations has not been systematically measured. We address six research questions about citation URL validity using 10 models and agents on DRBench (53,090 URLs) and 3 models on ExpertQA (168,021 URLs across 32 academic fields). We find that 3--13\% of citation URLs are hallucinated -- they have no record in the Wayback Machine and likely never existed -- while 5--18\% are non-resolving overall. Deep research agents generate substantially more citations per query than search-augmented LLMs but hallucinate URLs at higher rates. Domain effects are pronounced: non-resolving rates range from 5.4\% (Business) to 11.4\% (Theology), with per-model effects even larger. Decomposing failures reveals that some models fabricate every non-resolving URL, while others show substantial link-rot fractions indicating genuine retrieval. As a solution, we release urlhealth, an open-source tool for URL liveness checking and stale-vs-hallucinated classification using the Wayback Machine. In agentic self-correction experiments, models equipped with urlhealth reduce non-resolving citation URLs by $6\textrm{--}79\times$ to under 1\%, though effectiveness depends on the model's tool-use competence. The tool and all data are publicly available. Our characterization findings, failure taxonomy, and open-source tooling establish that citation URL validity is both measurable at scale and correctable in practice.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers