Verifier-Backed Hard Problem Generation for Mathematical Reasoning
Yuhang Lai, Jiazhan Feng, Yee Whye Teh, Ning Miao
TLDR
VHG is a novel verifier-enhanced framework for generating valid and challenging mathematical problems for LLMs, outperforming existing methods.
Key contributions
- Introduces VHG, a three-party self-play framework for generating hard mathematical problems.
- Integrates an independent verifier to ensure problem validity and guide difficulty.
- Evaluated VHG with symbolic and LLM-based verifiers on math reasoning tasks.
- VHG significantly outperforms baselines in generating valid and challenging problems.
Why it matters
LLMs need better problem generation for training and autonomous research. VHG addresses this by creating a robust method for generating valid and challenging math problems. This advancement can accelerate LLM development in scientific and mathematical domains.
Original Abstract
Large Language Models (LLMs) demonstrate strong capabilities for solving scientific and mathematical problems, yet they struggle to produce valid, challenging, and novel problems - an essential component for advancing LLM training and enabling autonomous scientific research. Existing problem generation approaches either depend on expensive human expert involvement or adopt naive self-play paradigms, which frequently yield invalid problems due to reward hacking. This work introduces VHG, a verifier-enhanced hard problem generation framework built upon three-party self-play. By integrating an independent verifier into the conventional setter-solver duality, our design constrains the setter's reward to be jointly determined by problem validity (evaluated by the verifier) and difficulty (assessed by the solver). We instantiate two verifier variants: a Hard symbolic verifier and a Soft LLM-based verifier, with evaluations conducted on indefinite integral tasks and general mathematical reasoning tasks. Experimental results show that VHG substantially outperforms all baseline methods by a clear margin.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.