Representation geometry shapes task performance in vision-language modeling for CT enterography

April 14, 20262604.13021

Cristian Minoccheri, Emily Wittrup, Kayvan Najarian, Ryan Stidham

cs.CVcs.AI

TLDR

This paper explores vision-language transfer learning for CT enterography, identifying optimal pooling and encoding strategies for IBD assessment and report generation.

Key contributions

Mean pooling excels for disease assessment (59.2% accuracy), while attention pooling is better for cross-modal retrieval (0.235 MRR).
Multi-window RGB encoding (per-slice contrast) outperforms multiplanar sampling for CT enterography classification.
Retrieval-augmented generation (RAG) significantly improves report generation, boosting severity accuracy by 7-14 points.
A three-teacher pseudolabel framework enables comprehensive comparisons without requiring expert annotations.

Why it matters

This paper establishes the first vision-language baselines for CT enterography, a key medical imaging modality. It provides practical guidance on representation choices, pooling, and RAG, significantly improving automated IBD assessment and reporting in medical imaging.

Original Abstract

Computed tomography (CT) enterography is a primary imaging modality for assessing inflammatory bowel disease (IBD), yet the representational choices that best support automated analysis of this modality are unknown. We present the first study of vision-language transfer learning on abdominal CT enterography and identify two main findings. First, mean pooling of slice embeddings gives better categorical disease assessment (59.2\% three-class accuracy), whereas attention pooling gives better cross-modal retrieval (0.235 text-to-image MRR). This pattern holds across all LoRA configurations tested and suggests that the two aggregators emphasize different properties of the learned representation. Second, per-slice tissue contrast matters more than broader spatial coverage: multi-window RGB encoding, which maps complementary Hounsfield Unit windows to RGB channels, outperforms all strategies that increase spatial coverage through multiplanar sampling, and in this setting adding coronal and sagittal views reduces classification performance. For report generation, fine-tuning without retrieval context yields within-1 severity accuracy at the prevalence-matched chance level (70.4\% vs.\ 71\% random), suggesting little learned ordering beyond the class distribution. Retrieval-augmented generation (RAG) improves this across all configurations, scoring 7--14 percentage points above the chance baseline and improving ordinal MAE from 0.98 to 0.80--0.89. A three-teacher pseudolabel framework enables all comparisons without expert annotations. Together, these findings provide the first baselines for this underexplored modality and offer practical guidance for building vision-language systems for volumetric medical imaging.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers