BibTeX Citation Hallucinations in Scientific Publishing Agents: Evaluation and Mitigation
Delip Rao, Chris Callison-Burch
TLDR
This paper evaluates BibTeX citation hallucinations in search-enabled LLMs, finding 50.9% full correctness, and proposes `clibib` to boost accuracy to 78.3%.
Key contributions
- Evaluated search-enabled LLMs on BibTeX generation using a new 931-paper benchmark across diverse domains.
- Found LLMs achieve 83.6% field accuracy but only 50.9% fully correct BibTeX entries, dropping for recent papers.
- Identified two failure modes: wholesale entry substitution and isolated field errors in BibTeX generation.
- Proposed `clibib`, a tool that boosts full correctness to 78.3% by revising LLM outputs against authoritative records.
Why it matters
LLMs are widely used in scientific publishing, but their citation errors undermine reliability. This paper provides a crucial benchmark and a practical mitigation tool (`clibib`) to improve the accuracy of automatically generated citations. It highlights the need for robust integration architectures for LLM-based agents.
Original Abstract
Large language models with web search are increasingly used in scientific publishing agents, yet they still produce BibTeX entries with pervasive field-level errors. Prior evaluations tested base models without search, which does not reflect current practice. We construct a benchmark of 931 papers across four scientific domains and three citation tiers -- popular, low-citation, and recent post-cutoff -- designed to disentangle parametric memory from search dependence, with version-aware ground truth accounting for multiple citable versions of the same paper. Three search-enabled frontier models (GPT-5, Claude Sonnet-4.6, Gemini-3 Flash) generate BibTeX entries scored on nine fields and a six-way error taxonomy, producing ~23,000 field-level observations. Overall accuracy is 83.6%, but only 50.9% of entries are fully correct; accuracy drops 27.7pp from popular to recent papers, revealing heavy reliance on parametric memory even when search is available. Field-error co-occurrence analysis identifies two failure modes: wholesale entry substitution (identity fields fail together) and isolated field error. We evaluate clibib, an open-source tool for deterministic BibTeX retrieval from the Zotero Translation Server with CrossRef fallback, as a mitigation mechanism. In a two-stage integration where baseline entries are revised against authoritative records, accuracy rises +8.0pp to 91.5%, fully correct entries rise from 50.9% to 78.3%, and regression rate is only 0.8%. An ablation comparing single-stage and two-stage integration shows that separating search from revision yields larger gains and lower regression (0.8% vs. 4.8%), demonstrating that integration architecture matters independently of model capability. We release the benchmark, error taxonomy, and clibib tool to support evaluation and mitigation of citation hallucinations in LLM-based scientific writing.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.