ArXiv TLDR

Grounded or Guessing? LVLM Confidence Estimation via Blind-Image Contrastive Ranking

🐦 Tweet
2605.10893

Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Charese H. Smiley, Ivan Brugere + 2 more

cs.CL

TLDR

BICR improves LVLM confidence estimation by contrasting real and blind image inputs, detecting visual ungroundedness with high accuracy and efficiency.

Key contributions

  • Introduces BICR, a model-agnostic framework for detecting visual ungroundedness in LVLMs.
  • Trains a lightweight probe using hidden states from real and blacked-out images with a ranking loss.
  • Achieves state-of-the-art calibration and discrimination across diverse LVLMs and tasks.
  • Significantly more parameter-efficient (4-18x fewer) than strong baselines, with zero inference cost.

Why it matters

LVLMs often "guess" answers based on language alone, leading to unreliable outputs despite high confidence. This paper addresses a critical flaw in LVLM reliability. BICR provides a robust, efficient way to identify when an LVLM's prediction is truly grounded in visual evidence, making these models more trustworthy for real-world applications.

Original Abstract

Large vision-language models suffer from visual ungroundedness: they can produce a fluent, confident, and even correct response driven entirely by language priors, with the image contributing nothing to the prediction. Existing confidence estimation methods cannot detect this, as they observe model behavior under normal inference with no mechanism to determine whether a prediction was shaped by the image or by text alone. We introduce BICR (Blind-Image Contrastive Ranking), a model-agnostic confidence estimation framework that makes this contrast explicit during training by extracting hidden states from a frozen LVLM twice: once with the real image-question pair, and once with the image blacked out while the question is held fixed. A lightweight probe is trained on the real-image hidden state and regularized by a ranking loss that penalizes higher confidence on the blacked-out view, teaching it to treat visual grounding as a signal of reliability at zero additional inference cost. Evaluated across five modern LVLMs and seven baselines on a benchmark covering visual question answering, object hallucination detection, medical imaging, and financial document understanding, BICR achieves the best cross-LVLM average on both calibration and discrimination simultaneously, with statistically significant discrimination gains robust to cluster-aware analysis at 4-18x fewer parameters than the strongest probing baseline.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.