Rollout Pass-Rate Control: Steering Binary-Reward RL Toward Its Most Informative Regime

May 6, 20262605.05112

Tianshu Zhu, Wenyu Zhang, Xiaoying Zuo, Lun Tian, Haotian Zhao + 6 more

cs.LG

TLDR

Prefix Sampling steers binary-reward RL to an optimal 50% pass rate, maximizing informativeness and speeding up SWE-bench-style agents.

Key contributions

Shows 50% pass rate maximizes informativeness in binary-reward RL by optimizing reward entropy and contrast.
Introduces Prefix Sampling (PS) to dynamically steer rollout groups toward this optimal 50% pass rate.
PS reuses trajectory prefixes to either boost failing groups or handicap passing groups.
Achieves 2.01x speedup on Qwen3-14B and 1.55x on Qwen3-32B for SWE-bench, improving performance.

Why it matters

This paper addresses a key inefficiency in binary-reward RL, where skewed pass rates waste compute. By identifying and steering towards an optimal 50% pass rate, it offers a principled way to make expensive agentic RL tasks significantly faster and more effective, crucial for real-world applications.

Original Abstract

SWE-bench-style agentic reinforcement learning relies on expensive stateful trajectories, yet substantial compute is wasted on sampled rollout groups with skewed pass rates, where binary rewards provide a weak contrastive signal. We frame this inefficiency as a pass-rate control problem and show that a 50% pass rate is the most informative operating point: it maximizes reward entropy, the probability of surviving group filtering, RLOO advantage energy under GRPO, and success--failure contrastive structure. Guided by this principle, we propose Prefix Sampling (PS), which replays trajectory prefixes to steer skewed groups toward this regime: successful prefixes serve as head starts for mostly failing groups, while failing prefixes serve as handicaps for mostly passing groups. In stateful agent environments, prefix states are reconstructed through replay while replayed tokens are excluded from the loss, restricting optimization to continuations generated by the current policy. On SWE-bench-style agentic RL, PS delivers end-to-end wall-clock speedups of 2.01x on Qwen3-14B and 1.55x on Qwen3-32B while preserving or improving final verified performance. For 14B, the SWE-bench Verified peak rises from the baseline peak of 0.273 to 0.295 under PS. Additional mathematical reasoning experiments on AIME 2025 show the same pass-rate control pattern and decompose the gains into replay, bidirectional coverage, and adaptive control.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers