MathDuels: Evaluating LLMs as Problem Posers and Solvers
Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik
TLDR
MathDuels is a self-play benchmark where LLMs author and solve math problems, revealing decoupled capabilities and dynamic evaluation.
Key contributions
- Introduces MathDuels, a self-play benchmark where LLMs act as both problem posers and solvers.
- Uses a three-stage pipeline with adversarial prompting to generate difficulty-amplified math problems.
- Employs a Rasch model to jointly estimate solver abilities and problem difficulties.
- Reveals that authoring and solving capabilities in LLMs are partially decoupled.
Why it matters
Existing math benchmarks saturate quickly, failing to differentiate advanced models. MathDuels addresses this by co-evolving problem difficulty with model strength, providing a dynamic evaluation. It uncovers new insights into LLM capabilities by evaluating them in dual roles, which single-role benchmarks miss.
Original Abstract
As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.