Scalable Token-Level Hallucination Detection in Large Language Models
Rui Min, Tianyu Pang, Chao Du, Minhao Cheng, Yi R. Fung
TLDR
TokenHD is a scalable pipeline for training token-level hallucination detectors in LLMs, outperforming larger models in detecting reasoning errors.
Key contributions
- Introduces TokenHD, a holistic pipeline for token-level hallucination detection in LLMs.
- Features a scalable data engine for synthesizing annotations and an importance-weighted training strategy.
- Detects hallucinations directly in free-form text, eliminating the need for step segmentation.
- A small TokenHD detector (0.6B) outperforms much larger reasoning models (e.g., QwQ-32B).
Why it matters
LLMs frequently hallucinate, particularly in complex reasoning, and existing detection methods are often limited. TokenHD offers a scalable, token-level solution that significantly improves detection accuracy and efficiency, even with smaller models. This enhances LLM reliability, crucial for trustworthy AI applications.
Original Abstract
Large language models (LLMs) have demonstrated remarkable capabilities, but they still frequently produce hallucinations. These hallucinations are difficult to detect in reasoning-intensive tasks, where the content appears coherent but contains errors like logical flaws and unreliable intermediate results. While step-level analysis is commonly used to detect internal hallucinations, it suffers from limited granularity and poor scalability due to its reliance on step segmentation. To address these limitations, we propose TokenHD, a holistic pipeline for training token-level hallucination detectors. Specifically, TokenHD consists of a scalable data engine for synthesizing large-scale hallucination annotations along with a training recipe featuring an importance-weighted strategy for robust model training. To systematically assess the detection performance, we also provide a rigorous evaluation protocol. Through training within TokenHD, our detector operates directly on free-form text to identify hallucinations, eliminating the need for predefined step segmentation or additional text reformatting. Our experiments show that even a small detector (0.6B) achieves substantial performance gains after training, surpassing much larger reasoning models (e.g., QwQ-32B), and detection performance scales consistently with model size from 0.6B to 8B. Finally, we show that our detector can generalize well across diverse practical scenarios and explore strategies to further enhance its cross-domain generalization capability.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.