ArXiv TLDR

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

🐦 Tweet
2604.18584

Shaden Alshammari, Kevin Wen, Abrar Zainal, Mark Hamilton, Navid Safaei + 3 more

cs.AIcs.DLcs.IRcs.LG

TLDR

MathNet introduces a large-scale, multimodal, multilingual benchmark for evaluating mathematical reasoning and retrieval in AI models.

Key contributions

  • Presents MathNet, a dataset of 30,676 Olympiad-level math problems with solutions across 47 countries and 17 languages.
  • Includes a novel benchmark for mathematical problem solving, math-aware retrieval, and retrieval-augmented problem solving.
  • Shows state-of-the-art models struggle, while retrieval-augmented generation significantly improves performance.

Why it matters

MathNet addresses the limitations of existing math benchmarks by providing an unprecedentedly large and diverse resource. It highlights current AI model weaknesses in complex mathematical reasoning and retrieval, while also demonstrating the potential of retrieval-augmented approaches to advance the field.

Original Abstract

Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, and multilingual dataset of Olympiad-level math problems together with a benchmark for evaluating mathematical reasoning in generative models and mathematical retrieval in embedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show that retrieval-augmented generation performance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.