ArXiv TLDR

Detecting Hallucinations in Large Language Models via Internal Attention Divergence Signals

🐦 Tweet
2605.05025

Gijs van Dijk

cs.CL

TLDR

This paper introduces a lightweight, single-pass method using internal attention divergence to detect hallucinations in large language models.

Key contributions

  • Proposes a lightweight, single-pass method to detect LLM hallucinations.
  • Uses Kullback-Leibler divergence of attention heads from a uniform distribution as uncertainty signals.
  • Shows attention divergence is highly predictive of answer correctness across various LLMs and tasks.
  • Finds the signal is concentrated in middle layers and factual tokens, offering an interpretable uncertainty metric.

Why it matters

Detecting hallucinations in LLMs is crucial for their reliability and trustworthiness. This method offers an efficient, interpretable, and internal way to quantify uncertainty, avoiding costly external models or repeated sampling. It provides a valuable white-box signal for improving LLM safety and accuracy.

Original Abstract

We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback-Leibler divergence between each attention head's distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is highly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.