ArXiv TLDR

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

🐦 Tweet
2605.00365

Anamika Lochab, Bolian Li, Ruqi Zhang

cs.LGcs.CLstat.ML

TLDR

UCPO improves diversity in RLVR by penalizing non-uniform distributions over correct solutions, boosting Pass@K while maintaining Pass@1.

Key contributions

  • Identifies that common RLVR objectives are indifferent to probability distribution among correct solutions, causing diversity collapse.
  • Formalizes the collapse mechanism and characterizes the Uniform-Correct Policy as uniquely optimal for robustness and entropy.
  • Proposes Uniform-Correct Policy Optimization (UCPO), which adds a conditional uniformity penalty to GRPO.
  • UCPO significantly improves Pass@K and diversity on math reasoning tasks, up to +10% Pass@64 and 45% higher diversity.

Why it matters

This paper tackles the diversity collapse issue in RLVR, a critical limitation for reasoning tasks. UCPO enables RLVR to maintain high single-attempt accuracy while significantly improving the breadth of correct solutions. This makes RLVR more robust and practical for applications needing diverse, verifiable outputs.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved substantial gains in single-attempt accuracy (Pass@1) on reasoning tasks, yet often suffers from reduced multi-sample coverage (Pass@K), indicating diversity collapse. We identify a structural cause for this degradation: common RLVR objectives, such as GRPO, are indifferent to how probability mass is distributed among correct solutions. Combined with stochastic training dynamics, this indifference induces a self-reinforcing collapse, in which probability mass concentrates on a narrow subset of correct outputs while alternative valid solutions are suppressed. We formalize this collapse mechanism and further characterize the optimal policy structure under two complementary criteria: robustness and entropy-regularized optimality, which identify the Uniform-Correct Policy as uniquely optimal. Motivated by this analysis, we propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty on the policy's distribution over correct solutions. The penalty redistributes gradient signal toward underrepresented correct responses, encouraging uniform allocation of probability mass within the correct set. Across three models (1.5B-7B parameters) and five mathematical reasoning benchmarks, UCPO improves Pass@K and diversity while maintaining competitive Pass@1, achieving up to +10\% absolute improvement on AIME24 at Pass@64 and up to 45\% higher equation-level diversity within the correct set. The code is available at https://github.com/AnamikaLochab/UCPO.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.