Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

April 10, 20262604.08941

cs.LG

TLDR

Predictive entropy effectively links and flags both miscalibration and paraphrase sensitivity in medical Vision-Language Models.

Key contributions

Predictive entropy from a single forward pass flags both miscalibrated and rephrase-sensitive predictions in medical VLMs.
Benchmarked five uncertainty methods on MedGemma and LLaVA-RAD, showing shared causes for VLM failure modes.
Simple predictive entropy outperforms complex ensembles and MC Dropout for error detection and paraphrase screening.
Ensembles struggled with out-of-distribution data for MedGemma, but LLaVA-RAD's ensemble did not collapse.

Why it matters

This paper highlights a simple, efficient method (predictive entropy) to improve the reliability and robustness of medical Vision-Language Models. By linking calibration and paraphrase sensitivity to a common cause, it offers a practical approach to enhance safe deployment, crucial for trustworthy AI in healthcare.

Original Abstract

Medical Vision Language Models VLMs suffer from two failure modes that threaten safe deployment mis calibrated confidence and sensitivity to question rephrasing. We show they share a common cause, proximity to the decision boundary, by benchmarking five uncertainty quantification methods on MedGemma 4BIT across in distribution MIMIC CXR and outof distribution PadChest chest X ray datasets, with cross architecture validation on LLaVA RAD7B. For well calibrated single model methods, predictive entropy from one forward pass predicts which samples will flip under rephrasing AUROC 0.711 on MedGemma, 0.878 on LLaVARAD p 10 4, enabling a single entropy threshold to flag both unreliable and rephrase sensitive predictions. A five member LoRA ensemble fails under the MIMIC PadChest shift 42.9 ECE, 34.1 accuracy, though LLaVA RAD s ensemble does not collapse 69.1. MC Dropout achieves the best calibration ECE 4.3 and selective prediction coverage 21.5 at 5 risk, yet total entropy from a single forward pass outperforms the ensemble for both error detection AUROC 0.743 vs 0.657 and paraphrase screening. Simple methods win.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers