ArXiv TLDR

SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos

🐦 Tweet
2604.09037

Xiyang Huang, Jiawei Lin, Keying Wu, Jiaxin Huang, Kailai Yang + 7 more

cs.CVcs.CLcs.HC

TLDR

SiMing-Bench evaluates MLLMs' ability to judge procedural correctness from continuous interactions in clinical skill videos, revealing current models' limitations.

Key contributions

  • Introduces SiMing-Bench, a benchmark for MLLMs evaluating procedural correctness.
  • Focuses on continuous interaction-driven state updates in full-length clinical skill videos.
  • Uses SiMing-Score, a physician-annotated dataset with real clinical examination videos and rubrics.
  • Shows MLLMs have weak agreement with physician judgments, especially on intermediate steps.

Why it matters

This paper addresses a critical gap in MLLM evaluation: procedural judgment from continuous interactions. It reveals current MLLMs struggle significantly with tracking interaction-driven state updates, even when overall performance seems acceptable. This highlights a crucial area for MLLM improvement in real-world applications.

Original Abstract

Current video benchmarks for multimodal large language models (MLLMs) focus on event recognition, temporal ordering, and long-context recall, but overlook a harder capability required for expert procedural judgment: tracking how ongoing interactions update the procedural state and thereby determine the correctness of later actions. We introduce SiMing-Bench, the first benchmark for evaluating this capability from full-length clinical skill videos. It targets rubric-grounded process-level judgment of whether interaction-driven state updates preserve procedural correctness across an entire workflow. SiMing-Bench is instantiated with SiMing-Score, a physician-annotated dataset of real clinical skill examination videos spanning cardiopulmonary resuscitation, automated external defibrillator operation, and bag-mask ventilation, each paired with a standardized step-wise rubric and dual-expert labels. Across diverse open- and closed-source MLLMs, we observe consistently weak agreement with physician judgments. Moreover, weak performance on rubric-defined intermediate steps persists even when overall procedure-level correlation appears acceptable, suggesting that coarse global assessment substantially overestimates current models' procedural judgment ability. Additional analyses with binary step judgment and step-aligned clips indicate that the bottleneck is not merely fine-grained scoring or temporal localization, but modeling how continuous interactions update procedural state over time.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.