Robustness Evaluation of a Foundation Segmentation Model Under Simulated Domain Shifts in Abdominal CT: Implications for Health Digital Twin Deployment
TLDR
SAM demonstrates stable spleen segmentation in abdominal CT despite simulated domain shifts, supporting its robust use in medical imaging.
Key contributions
- Audited SAM's spleen segmentation robustness in abdominal CT across 1,051 slices.
- Applied simulated domain shifts (noise, blur, contrast, gamma, resolution mismatch) to CT images.
- Observed SAM's Dice score drop by less than 0.01, indicating high stability.
- Confirmed no significant increase in segmentation failure rates under various perturbations.
Why it matters
Foundation models are key for health digital twins and medical imaging. Quantifying their robustness under realistic clinical variability is crucial for trustworthy deployment. This study validates SAM's stability against common CT variations, supporting its reliable integration into clinical tools.
Original Abstract
Foundation segmentation models such as the Segment Anything Model (SAM) have demonstrated strong generalization across natural images; however, their robustness under clinically realistic medical imaging domain shifts remains insufficiently quantified. We present a systematic slice-level robustness audit of SAM (ViT-B) for spleen segmentation in abdominal CT using 1,051 nonempty slices from 41 volumes in the Medical Segmentation Decathlon. A standardized ground-truth-derived bounding-box protocol was used to isolate encoder robustness from prompt uncertainty. Controlled perturbations simulating inter-scanner variability, including Gaussian noise, blur, contrast scaling, gamma correction, and resolution mismatch, were applied across ten conditions. The clean baseline achieved a mean Dice score of 0.9145 (95% CI: [0.909, 0.919]) with a failure rate of 0.67%. Across all perturbations, the absolute mean ΔDice remained below 0.01. Paired Wilcoxon signed-rank tests with Benjamini-Hochberg false discovery rate correction identified statistically significant but small-magnitude changes under selected conditions, while McNemar analysis showed no significant increase in failure probability. These findings indicate that SAM exhibits stable segmentation behavior under moderate CT domain shifts, supporting its role as a robust foundation baseline for medical image segmentation research. As health digital twins increasingly incorporate foundation segmentation models for anatomical modeling and organ-level monitoring, formal characterization of robustness under real-world imaging variability is a necessary step toward trustworthy deployment.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.