ArXiv TLDR

MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

🐦 Tweet
2605.05175

Perry E. Radau

eess.IVcs.CLphysics.med-ph

TLDR

MRI-Eval is a new tiered benchmark revealing that LLMs, despite high MCQ scores, struggle with free-text recall of MRI physics and GE scanner operations.

Key contributions

  • Created MRI-Eval, a 1365-item tiered benchmark for MRI physics and GE scanner operations knowledge.
  • Showed LLMs achieve high MCQ scores (93-97%) but struggle with GE scanner operations (88-94%).
  • Stem-only tests revealed significant free-text recall weaknesses, especially for GE operations (13-29%).
  • Warns against using raw LLM outputs for vendor-specific protocol guidance due to recall gaps.

Why it matters

This paper introduces MRI-Eval, a crucial benchmark that exposes significant LLM knowledge gaps in MRI physics and GE scanner operations, especially in free-text recall. It highlights that high MCQ scores can mask these weaknesses, urging caution when using LLMs for specialized, vendor-specific guidance.

Original Abstract

Background: Existing MRI LLM benchmarks rely mainly on review-book multiple-choice questions, where top proprietary models already score highly, limiting discrimination. No systematic benchmark has evaluated vendor-specific scanner operational knowledge central to research MRI practice. Purpose: We developed MRI-Eval, a tiered benchmark for relative model comparison on MRI physics and GE scanner operations knowledge using primary multiple-choice questions (MCQ), with stem-only and primed diagnostic conditions as complementary analyses. Methods: MRI-Eval includes 1365 scored items across nine categories and three difficulty tiers from textbooks, GE scanner manuals, programming course materials, and expert-generated questions. Five model families were evaluated (GPT-5.4, Claude Opus 4.6, Claude Sonnet 4.6, Gemini 2.5 Pro, Llama 3.3 70B). MCQ was primary; stem-only removed options and used an independent LLM judge; primed stem-only tested responses to incorrect user claims. Results: Overall MCQ accuracy was 93.2% to 97.1%. GE scanner operations was the lowest category for every model (88.2% to 94.6%). In stem-only, frontier-model accuracy fell to 58.4% to 61.1%, and Llama 3.3 70B fell to 37.1%; GE scanner operations stem-only accuracy was 13.8% to 29.8%. Conclusion: High MCQ performance can mask weak free-text recall, especially for vendor-specific operational knowledge. MRI-Eval is most informative as a relative comparison benchmark rather than an absolute competency measure and supports caution in using raw LLM outputs for GE-specific protocol guidance.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.