ArXiv TLDR

PubMed-Ophtha: An open resource for training ophthalmology vision-language models on scientific literature

🐦 Tweet
2605.02720

Verena Jasmin Hallitschke, Carsten Eickhoff, Philipp Berens

cs.CVcs.CL

TLDR

PubMed-Ophtha is a new 102K image-caption dataset for training ophthalmology vision-language models, extracted from scientific literature.

Key contributions

  • 102,023 ophthalmological image-caption pairs from 15,842 open-access PubMed Central articles.
  • Figures extracted at full resolution, decomposed into panels, identifiers, and individual images.
  • Images annotated with imaging modality (e.g., OCT, fundus) and mark status (presence of arrows).
  • LLM-based approach splits captions into panel-level subcaptions with high accuracy (BLEU 0.913).

Why it matters

This paper addresses the critical scarcity of high-quality image-text datasets for ophthalmology vision-language models. By providing a large, meticulously curated, and hierarchically structured dataset, PubMed-Ophtha will accelerate research and development in this crucial medical AI field, fostering reproducible advancements.

Original Abstract

Vision-language models hold considerable promise for ophthalmology, but their development depends on large-scale, high-quality image-text datasets that remain scarce. We present PubMed-Ophtha, a hierarchical dataset of 102,023 ophthalmological image-caption pairs extracted from 15,842 open-access articles in PubMed Central. Unlike existing datasets, figures are extracted directly from article PDFs at full resolution and decomposed into their constituent panels, panel identifiers, and individual images. Each image is annotated with its imaging modality -- color fundus photography, optical coherence tomography, retinal imaging, or other -- and a mark status indicating the presence of annotation marks such as arrows. Figure captions are split into panel-level subcaptions using a two-step LLM approach, achieving a mean average sentence BLEU score of 0.913 on human-annotated data. Panel and image detection models reach a mAP@0.50 of 0.909 and 0.892, respectively, and figure extraction achieves a median IoU of 0.997. To support reproducibility, we additionally release the human-annotated ground-truth data, all trained models, and the full dataset generation pipeline.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.