Good Rankings, Wrong Probabilities: A Calibration Audit of Multimodal Cancer Survival Models

April 5, 20262604.04239

cs.LGcs.AIq-bio.QM

TLDR

Multimodal cancer survival models, despite good ranking performance, often produce miscalibrated survival probabilities, highlighting the need for calibration audits.

Key contributions

Conducted the first systematic 1-calibration audit of multimodal WSI-genomics cancer survival models.
Found most models, including high C-index ones, fail 1-calibration across various cancer types and architectures.
Identified gating-based fusion as improving calibration and post-hoc Platt scaling as reducing miscalibration.
Concludes that the concordance index alone is insufficient for evaluating survival models for clinical use.

Why it matters

This paper reveals a critical flaw in current multimodal cancer survival models: their predicted probabilities are often unreliable despite good ranking. It emphasizes that calibration, not just discrimination, is essential for clinical applicability. This work provides a crucial audit and suggests methods to improve model trustworthiness.

Original Abstract

Multimodal deep learning models that fuse whole-slide histopathology images with genomic data have achieved strong discriminative performance for cancer survival prediction, as measured by the concordance index. Yet whether the survival probabilities derived from these models - either directly from native outputs or via standard post-hoc reconstruction - are calibrated remains largely unexamined. We conduct, to our knowledge, the first systematic fold-level 1-calibration audit of multimodal WSI-genomics survival architectures, evaluating native discrete-time survival outputs (Experiment A: 3 models on TCGA-BRCA) and Breslow-reconstructed survival curves from scalar risk scores (Experiment B: 11 architectures across 5 TCGA cancer types). In Experiment A, all three models fail 1-calibration on a majority of folds (12 of 15 fold-level tests reject after Benjamini-Hochberg correction). Across the full 290 fold-level tests, 166 reject the null of correct calibration at the median event time after Benjamini-Hochberg correction (FDR = 0.05). MCAT achieves C-index 0.817 on GBMLGG yet fails 1-calibration on all five folds. Gating-based fusion is associated with better calibration; bilinear and concatenation fusion are not. Post-hoc Platt scaling reduces miscalibration at the evaluated horizon (e.g., MCAT: 5/5 folds failing to 2/5) without affecting discrimination. The concordance index alone is insufficient for evaluating survival models intended for clinical use.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers