ArXiv TLDR

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

🐦 Tweet
2604.18493

Zhenwen Liang, Yujun Zhou, Sidi Lu, Xiangliang Zhang, Haitao Mi + 1 more

cs.LG

TLDR

This paper introduces CUTS and Mixed-CUTS to prevent mode collapse in RL for LLMs on saturated reasoning data, boosting generalization.

Key contributions

  • Addresses RL mode collapse in LLMs on saturated reasoning data where solutions are too homogeneous.
  • Introduces Constrained Uniform Top-K Sampling (CUTS) for structure-preserving exploration.
  • Proposes Mixed-CUTS, a training framework combining exploitative and exploratory rollouts.
  • Improves Pass@1 accuracy on AIME25 by up to 15.1% over GRPO, boosting out-of-domain generalization.

Why it matters

This paper tackles a key challenge in scaling RL for LLMs: preventing mode collapse on saturated reasoning data. By ensuring diverse exploration, it allows models to learn more robustly even when base models are already strong. This is crucial for improving out-of-domain generalization in complex reasoning tasks.

Original Abstract

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.