ArXiv TLDR

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

🐦 Tweet
2604.13016

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao + 6 more

cs.LGcs.AIcs.CL

TLDR

This paper investigates on-policy distillation (OPD) dynamics in LLMs, identifying success conditions, token-level mechanisms, and practical recovery strategies.

Key contributions

  • Identifies two key conditions for successful on-policy distillation: compatible thinking patterns and novel teacher capabilities.
  • Reveals successful OPD involves progressive alignment on high-probability tokens in student-visited states.
  • Proposes practical strategies like off-policy cold start and teacher-aligned prompt selection to fix failing OPD.
  • Highlights that OPD's dense token-level reward may limit its scalability for long-horizon distillation.

Why it matters

This paper provides crucial insights into the poorly understood dynamics of on-policy distillation (OPD) for LLMs. It offers a mechanistic understanding and practical solutions to improve OPD success rates. Understanding these factors is vital for effectively training and scaling large language models.

Original Abstract

On-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood. This paper provides a systematic investigation of OPD dynamics and mechanisms. We first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training. We validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student's perspective. Probing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97%-99%). We further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection. Finally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.