ArXiv TLDR

MedHorizon: Towards Long-context Medical Video Understanding in the Wild

🐦 Tweet
2605.06537

Bodong Du, Bowen Liu, Yang Yu, Xinpeng Ding, Zhiheng Wu + 6 more

cs.CV

TLDR

MedHorizon introduces a new benchmark for long-context medical video understanding, revealing current MLLMs struggle with sparse evidence retrieval and clinical reasoning.

Key contributions

  • Introduces MedHorizon, a new benchmark for long-context medical video understanding in the wild.
  • Features 759 hours of full clinical procedures and 1,253 evidence-grounded multiple-choice questions.
  • Highlights extreme evidence sparsity (0.166% frames), challenging models to retrieve before reasoning.
  • Reveals current MLLMs achieve only 41.1% accuracy, struggling with retrieval, interpretation, and attention drift.

Why it matters

This paper addresses a critical gap in medical AI by providing a realistic benchmark for full-procedure video understanding. It exposes significant limitations in current MLLMs regarding sparse evidence retrieval and complex clinical reasoning. MedHorizon will drive the development of more robust and clinically applicable AI systems for long medical videos.

Original Abstract

Medical multimodal large language models (MLLMs) have advanced image understanding and short-video analysis, but real clinical review often requires full-procedure video understanding. Unlike general long videos, medical procedures contain highly redundant anatomical views, while decisive evidence is temporally sparse, spatially subtle, and context dependent. Existing benchmarks often assume this evidence has already been localized through images, short clips, or pre-segmented videos, leaving the retrieval-before-reasoning problem under-tested. We introduce MedHorizon, an in-the-wild benchmark for long-context medical video understanding. MedHorizon preserves 759 hours of full-length clinical procedures and provides 1,253 evidence-grounded multiple-choice questionsthat jointly evaluate sparse evidence understanding and multi-hop clinical reasoning. Its evidence is extremely sparse, with only 0.166% evidence frames on average, requiring models to search noisy procedural streams before interpreting and aggregating findings. We evaluate representative general-domain, medical-domain, and long-video MLLMs. The best model reaches only 41.1% accuracy, showing that current systems remain far from robust full-procedure understanding. Further analysis yields four key findings: performance does not scale reliably with more frames, evidence retrieval and clinical interpretation remain primary bottlenecks; these bottlenecks are rooted in weak procedural reasoning and attention drift under redundancy, and generic sampling methods only partially balances local detail with global coverage. MedHorizon provides a rigorous testbed for MLLMs that retrieve sparse evidence and reason over complete clinical workflows.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.