Unmasking On-Policy Distillation: Where It Helps, Where It Hurts, and Why
Mohammadreza Armandpour, Fatih Ilhan, David Harrison, Ajay Jaiswal, Duc N. M Hoang + 4 more
TLDR
This paper introduces a diagnostic framework to analyze on-policy distillation, revealing it helps more on incorrect rollouts and that optimal context varies.
Key contributions
- Introduces a training-free diagnostic framework for high-resolution analysis of on-policy distillation.
- Derives an ideal per-node gradient to quantify parameter updates for maximal student success.
- Develops a scalable targeted-rollout algorithm to efficiently estimate this ideal gradient.
- Shows distillation helps more on incorrect rollouts; optimal context varies by student capacity and task.
Why it matters
Understanding when and why on-policy distillation works is crucial for training effective reasoning models. This framework provides fine-grained insights, enabling more targeted and efficient application of distillation techniques by moving beyond costly aggregate metrics to token-level dynamics.
Original Abstract
On-policy distillation offers dense, per-token supervision for training reasoning models; however, it remains unclear under which conditions this signal is beneficial and under which it is detrimental. Which teacher model should be used, and in the case of self-distillation, which specific context should serve as the supervisory signal? Does the optimal choice vary from one token to the next? At present, addressing these questions typically requires costly training runs whose aggregate performance metrics obscure the dynamics at the level of individual tokens. We introduce a training-free diagnostic framework that operates at the highest resolution: per token, per question, and per teacher. We derive an ideal per-node gradient defined as the parameter update that maximally increases the student's probability of success. We then develop a scalable targeted-rollout algorithm to estimate this gradient efficiently, even for long chains of intermediate thoughts. The gradient alignment score, defined as the cosine similarity between this ideal gradient and any given distillation gradient, quantifies the extent to which a particular configuration approximates the ideal signal. Across a range of self-distillation settings and external teacher models, we observe that distillation guidance exhibits substantially higher alignment with the ideal on incorrect rollouts than on correct ones, where the student already performs well and the teacher's signal tends to become noisy. Furthermore, we find that the optimal distillation context depends jointly on the student model's capacity and the target task, and that no single universally effective configuration emerges. These findings motivate the use of per-task, per-token diagnostic analyses for distillation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.