ArXiv TLDR

MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

🐦 Tweet
2604.18418

Jiyao Liu, Jianghan Shen, Sida Song, Tianbin Li, Xiaojia Liu + 17 more

cs.CV

TLDR

MedProbeBench is a new benchmark evaluating LLMs' deep evidence integration for generating expert-level medical guidelines, revealing current limitations.

Key contributions

  • Introduces MedProbeBench, the first benchmark using high-quality clinical guidelines as expert references.
  • Proposes MedProbe-Eval, a comprehensive framework with 1,200+ holistic rubrics for quality assessment.
  • Features fine-grained evidence verification grounded in over 5,130 atomic claims.
  • Evaluates 17 LLMs and deep research agents, highlighting critical gaps in expert-level guideline generation.

Why it matters

Current benchmarks fail to assess LLMs' ability to integrate deep evidence for complex medical tasks. MedProbeBench fills this gap by providing a rigorous evaluation framework based on expert-level clinical guidelines. This highlights critical areas for improvement in AI systems aiming for medical applications.

Original Abstract

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.