Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

April 2, 20262604.02288

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng + 4 more

cs.LGcs.AI

TLDR

SRPO unifies GRPO and SDPO, routing samples for targeted correction and stable training, achieving superior performance and efficiency in RLVR.

Key contributions

Proposes SRPO, a unified framework combining GRPO's stability with SDPO's targeted correction.
Routes correct samples to GRPO and failed samples to SDPO for efficient, focused optimization.
Incorporates entropy-aware dynamic weighting to improve distillation target reliability.
Achieves rapid early improvement, long-term stability, and outperforms baselines by up to 6.3%.

Why it matters

Reinforcement learning with verifiable rewards (RLVR) is crucial for LLM post-training, but existing methods like GRPO and SDPO have limitations in efficiency or stability. SRPO addresses these by intelligently combining their strengths. This leads to more robust and performant LLMs with reduced compute costs.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers