The First Token Knows: Single-Decode Confidence for Hallucination Detection
TLDR
This paper introduces "phi_first," a single-decode method using first-token confidence to detect hallucinations, outperforming multi-sample self-consistency.
Key contributions
- Proposes "phi_first," a low-cost hallucination detection method based on first-token confidence.
- Achieves AUROC of 0.820, outperforming multi-sample self-consistency (0.791) and semantic self-consistency (0.793).
- Uses a single greedy decode, avoiding costly multiple sampling and external NLI overhead.
- Demonstrates that initial token distribution captures much of the uncertainty information.
Why it matters
Current hallucination detection methods are costly and complex, requiring multiple decodes. This paper offers a significantly simpler, more efficient, and equally effective approach. It suggests that models reveal their uncertainty very early, providing a new baseline for future research.
Original Abstract
Self-consistency detects hallucinations by generating multiple sampled answers to a question and measuring agreement, but this requires repeated decoding and can be sensitive to lexical variation. Semantic self-consistency improves this by clustering sampled answers by meaning using natural language inference, but it adds both sampling cost and external inference overhead. We show that first-token confidence, phi_first, computed from the normalized entropy of the top-K logits at the first content-bearing answer token of a single greedy decode, matches or modestly exceeds semantic self-consistency on closed-book short-answer factual question answering. Across three 7-8B instruction-tuned models and two benchmarks, phi_first achieves a mean AUROC of 0.820, compared with 0.793 for semantic agreement and 0.791 for standard surface-form self-consistency. A subsumption test shows that phi_first is moderately to strongly correlated with semantic agreement, and combining the two signals yields only a small AUROC improvement over phi_first alone. These results suggest that much of the uncertainty information captured by multi-sample agreement is already available in the model's initial token distribution. We argue that phi_first should be reported as a default low-cost baseline before invoking sampling-based uncertainty estimation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.