Where Does Reasoning Break? Step-Level Hallucination Detection via Hidden-State Transport Geometry
TLDR
This paper introduces a novel method for detecting step-level hallucinations in LLMs by analyzing hidden-state transport geometry during a single forward pass.
Key contributions
- Detects step-level hallucinations by analyzing hidden-state trajectory dynamics in a single forward pass.
- Proposes a teacher model using contrastive PCA and geometric features for step scoring.
- Develops a distilled BiLSTM student model that operates on raw hidden states.
- Demonstrates superior performance over baselines, highlighting transferability challenges.
Why it matters
This work shifts hallucination detection from trace-level to step-level, enabling precise error localization. It introduces a novel geometric perspective on hidden-state trajectories and reveals critical challenges for deploying such detectors under distribution shift.
Original Abstract
Large language models hallucinate during multi-step reasoning, but most existing detectors operate at the trace level: they assign one confidence score to a full output, fail to localize the first error, and often require multiple sampled completions. We frame hallucination instead as a property of the hidden-state trajectory produced during a single forward pass. Correct reasoning moves through a stable manifold of locally coherent transitions; a first error appears as a localized excursion in transport cost away from this manifold. We operationalize this view with a label-conditioned teacher that builds a trace-specific contrastive PCA lens and scores each step with seven geometric transition features, and a deployable BiLSTM student distilled from the teacher that operates on raw hidden states without inference-time labels. We prove that contrastive PCA is the optimal projection for a transport-separation objective between first error and correct states, and that single-pass first error localization holds whenever the first error creates a positive transport margin over preceding correct transitions. On ProcessBench, PRM800K, HaluEval, and TruthfulQA, both models outperform entropy-based, probing-based, and attention-based baselines in-domain; the teacher transfers stably across language models and datasets, while the student collapses under shift, a gap our distillation theory predicts. These results recast step-level hallucination detection as a problem of trajectory dynamics and identify the central obstacle to deployment: preserving the contrastive transport margin under distribution shift.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.