Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

May 7, 20262605.06635

Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah + 1 more

cs.CL

TLDR

This paper introduces a framework to evaluate LLM research agents' source attribution, revealing high link validity but low factual accuracy in citations.

Key contributions

Developed a novel framework using an AST parser to evaluate LLM citation quality at scale.
Evaluates citations across three dimensions: link accessibility, content relevance, and factual accuracy.
Benchmarked 14 LLMs, finding high link validity (>94%) and relevance (>80%) but low factual accuracy (39-77%).
Showed factual accuracy drops ~42% for frontier models as retrieval depth increases, highlighting a critical disconnect.

Why it matters

LLM deep research agents are increasingly used, but their cited information is often factually incorrect despite appearing valid. This framework provides crucial tools to assess and improve the reliability of LLM-generated research, addressing a critical gap in current evaluation methods.

Original Abstract

Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers