ArXiv TLDR

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

🐦 Tweet
2604.19697

Jing Jin, Hao Liu, Yan Bai, Yihang Lou, Zhenke Wang + 6 more

cs.CV

TLDR

Introduces StepSTEM, a new benchmark and evaluation framework for fine-grained, cross-modal STEM reasoning in MLLMs, revealing current models struggle.

Key contributions

  • StepSTEM: A new graduate-level benchmark (283 problems) for fine-grained cross-modal STEM reasoning in MLLMs.
  • Enforces strict complementarity between text and visual inputs to prevent unimodal shortcuts in MLLM evaluation.
  • Proposes a step-level evaluation framework using dynamic programming for detailed reasoning process assessment.

Why it matters

This paper addresses a critical gap in MLLM evaluation by providing a robust benchmark that prevents unimodal shortcuts and assesses the reasoning process, not just final answers. It highlights that even advanced MLLMs struggle with genuine cross-modal STEM reasoning, paving the way for future research.

Original Abstract

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.