Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models
Aleksandr Rubashevskii, Dzianis Piatrashyn, Preslav Nakov, Maxim Panov
TLDR
An adaptive conformal prediction method improves LLM factuality by enabling prompt-dependent calibration, enhancing reliability and conditional coverage.
Key contributions
- Proposes adaptive conformal prediction (ACP) for LLMs to address factual errors.
- Extends conformal score transformation for prompt-dependent calibration.
- Maintains marginal coverage while significantly improving conditional coverage.
- Supports selective prediction to filter unreliable LLM claims or answer choices.
Why it matters
LLMs often generate incorrect information, limiting their trustworthiness in critical applications. This method offers a robust way to quantify and improve the factual reliability of LLM outputs. By adapting to specific prompts, it ensures more accurate uncertainty estimates, making LLMs safer and more dependable for users.
Original Abstract
Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.