ArXiv TLDR

Contrastive Learning of Medical Visual Representations from Paired Images and Text

🐦 Tweet
2010.00747

Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D. Manning, Curtis P. Langlotz

cs.CVcs.CLcs.LG

TLDR

ConVIRT is an unsupervised contrastive learning method that leverages paired medical images and their descriptive text to learn superior visual representations, significantly improving data efficiency and performance on medical imaging tasks.

Key contributions

  • Introduces ConVIRT, a bidirectional contrastive learning framework using paired medical images and text without requiring expert annotations.
  • Demonstrates that ConVIRT pretrained models outperform ImageNet-based and other baselines on multiple medical image classification and zero-shot retrieval tasks.
  • Shows that ConVIRT achieves comparable or better results using only 10% of labeled data compared to ImageNet pretrained models, highlighting improved data efficiency.

Why it matters

This paper addresses the critical challenge of limited annotated medical image data by exploiting naturally paired text reports to learn rich visual representations without supervision. By bridging the gap between medical images and their textual descriptions through contrastive learning, ConVIRT enables more effective and efficient training of medical image models. This approach reduces reliance on costly expert annotations and domain-specific label extraction, potentially accelerating the development of robust medical imaging AI systems that generalize better across tasks and datasets.

Original Abstract

Learning visual representations of medical images (e.g., X-rays) is core to medical image understanding but its progress has been held back by the scarcity of human annotations. Existing work commonly relies on fine-tuning weights transferred from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. Meanwhile, several recent studies show exciting results from unsupervised contrastive learning from natural images, but we find these methods help little on medical images because of their high inter-class similarity. We propose ConVIRT, an alternative unsupervised strategy to learn medical visual representations by exploiting naturally occurring paired descriptive text. Our new method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test ConVIRT by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that it leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10\% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.