Self-Distilled RLVR

April 3, 20262604.03128

Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu + 5 more

cs.LGcs.CL

TLDR

RLSD combines RLVR with self-distillation to provide fine-grained updates and reliable directions, improving LLM training stability and convergence.

Key contributions

Identifies information leakage and instability issues in on-policy self-distillation (OPSD) for LLMs.
Proposes RLSD, a novel method combining RLVR with self-distillation for robust training.
Leverages self-distillation for fine-grained, token-level policy update magnitudes.
Utilizes RLVR to derive reliable update directions from environmental feedback.

Why it matters

This paper addresses critical instability issues in on-policy self-distillation for LLMs. By combining the strengths of RLVR and self-distillation, RLSD offers a more stable and higher-performing training paradigm. This improves the reliability and effectiveness of LLM fine-tuning.

Original Abstract

On-policy distillation (OPD) has become a popular training paradigm in the LLM community. This paradigm selects a larger model as the teacher to provide dense, fine-grained signals for each sampled trajectory, in contrast to reinforcement learning with verifiable rewards (RLVR), which only obtains sparse signals from verifiable outcomes in the environment. Recently, the community has explored on-policy self-distillation (OPSD), where the same model serves as both teacher and student, with the teacher receiving additional privileged information such as reference answers to enable self-evolution. This paper demonstrates that learning signals solely derived from the privileged teacher result in severe information leakage and unstable long-term training. Accordingly, we identify the optimal niche for self-distillation and propose \textbf{RLSD} (\textbf{RL}VR with \textbf{S}elf-\textbf{D}istillation). Specifically, we leverage self-distillation to obtain token-level policy differences for determining fine-grained update magnitudes, while continuing to use RLVR to derive reliable update directions from environmental feedback (e.g., response correctness). This enables RLSD to simultaneously harness the strengths of both RLVR and OPSD, achieving a higher convergence ceiling and superior training stability.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers