Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy, Subhajit Chaudhury, Prasanna Sattigeri
TLDR
Trace Inversion helps LLMs know when to abstain by inverting reasoning traces to detect if they answered the wrong question.
Key contributions
- Proposes the Query Misalignment Framework, reinterpreting hallucinations as answering the wrong question.
- Introduces Trace Inversion, a novel state-of-the-art method for LLM abstention.
- Reconstructs the most likely query from an LLM's reasoning trace and compares it to the original.
- Flags LLMs to abstain when original and reconstructed queries show low similarity, boosting reliability.
Why it matters
For reliable deployment, LLMs must effectively know when to abstain, a challenge particularly for reasoning models. This paper's Trace Inversion method significantly improves abstention performance across frontier LLMs and datasets, enhancing their trustworthiness and safety.
Original Abstract
For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.