VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning
Wenyi Xiao, Xinchi Xu, Leilei Gan
TLDR
VL-Calibration decouples confidence in LVLMs into visual and reasoning components, improving calibration and reducing hallucinations for reliable multimodal AI.
Key contributions
- Proposes VL-Calibration, an RL framework for decoupled visual and reasoning confidence in LVLMs.
- Introduces intrinsic visual certainty estimation via KL-divergence under perturbations and token entropy.
- Uses token-level advantage reweighting to suppress ungrounded hallucinations and preserve valid perception.
- Achieves improved calibration and boosts visual reasoning accuracy across diverse benchmarks.
Why it matters
LVLMs often hallucinate with high certainty, limiting their use in critical applications. VL-Calibration addresses this by decoupling confidence, leading to more reliable and accurate multimodal reasoning. This enhances trust and broadens LVLM applicability.
Original Abstract
Large Vision Language Models (LVLMs) achieve strong multimodal reasoning but frequently exhibit hallucinations and incorrect responses with high certainty, which hinders their usage in high-stakes domains. Existing verbalized confidence calibration methods, largely developed for text-only LLMs, typically optimize a single holistic confidence score using binary answer-level correctness. This design is mismatched to LVLMs: an incorrect prediction may arise from perceptual failures or from reasoning errors given correct perception, and a single confidence conflates these sources while visual uncertainty is often dominated by language priors. To address these issues, we propose VL-Calibration, a reinforcement learning framework that explicitly decouples confidence into visual and reasoning confidence. To supervise visual confidence without ground-truth perception labels, we introduce an intrinsic visual certainty estimation that combines (i) visual grounding measured by KL-divergence under image perturbations and (ii) internal certainty measured by token entropy. We further propose token-level advantage reweighting to focus optimization on tokens based on visual certainty, suppressing ungrounded hallucinations while preserving valid perception. Experiments on thirteen benchmarks show that VL-Calibration effectively improves calibration while boosting visual reasoning accuracy, and it generalizes to out-of-distribution benchmarks across model scales and architectures.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.