ArXiv TLDR

PAPERMIND: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

🐦 Tweet
2604.21304

Yanjun Zhao, Tianxin Wei, Jiaru Zou, Xuying Ning, Yuanchen Bei + 5 more

cs.IR

TLDR

PAPERMIND is a new benchmark evaluating multimodal LLMs' integrated reasoning and critique over scientific papers across diverse domains.

Key contributions

  • Introduces PAPERMIND, a benchmark for integrated, agent-oriented scientific reasoning over real papers.
  • Covers 7 scientific domains, including agriculture, biology, chemistry, computer science, and medicine.
  • Features four task families: multimodal grounding, experimental interpretation, cross-source reasoning, and critical assessment.
  • Reveals significant performance gaps in LLMs, highlighting challenges in complex scientific reasoning.

Why it matters

This paper addresses a crucial gap in evaluating LLMs' ability to perform complex, integrated scientific reasoning, essential for real-world applications. By providing a comprehensive benchmark, it helps diagnose current model limitations and guides future research towards more human-like scientific understanding and critique.

Original Abstract

Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PAPERMIND, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PAPERMIND is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PAPERMIND enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// github.com/Yanjun-Zhao/PaperMind.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.