Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals
TLDR
This paper introduces a lightweight, single-pass method using internal attention divergence to detect hallucinations in large language models.
Key contributions
- Proposes a lightweight, single-pass method to detect LLM hallucinations.
- Uses Kullback-Leibler divergence of attention heads from a uniform distribution as uncertainty signals.
- Shows attention divergence is highly predictive of answer correctness across various LLMs and tasks.
- Finds the signal is concentrated in middle layers and factual tokens, offering an interpretable uncertainty metric.
Why it matters
Detecting hallucinations in LLMs is crucial for their reliability and trustworthiness. This method offers an efficient, interpretable, and internal way to quantify uncertainty, avoiding costly external models or repeated sampling. It provides a valuable white-box signal for improving LLM safety and accuracy.
Original Abstract
We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head's distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.