ArXiv TLDR

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

🐦 Tweet
2604.08527

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han + 2 more

cs.CLcs.LG

TLDR

This paper identifies and solves length inflation in on-policy distillation (OPD) for LLMs, improving training stability and performance.

Key contributions

  • Identifies "length inflation" and "truncation collapse" as a failure mode in on-policy distillation (OPD).
  • Shows that OPD implicitly favors long, repetitive rollouts, causing unstable training and performance drops.
  • Proposes StableOPD, combining a reference-based divergence constraint and rollout mixture distillation.
  • StableOPD prevents truncation collapse, stabilizes training, and boosts performance by 7.2% on math reasoning.

Why it matters

On-policy distillation (OPD) is vital for training efficient LLMs, but its instability due to length inflation is a major hurdle. This paper critically explains this problem and offers StableOPD, a robust solution. It significantly improves the reliability and effectiveness of distillation.

Original Abstract

On-policy distillation (OPD) trains student models under their own induced distribution while leveraging supervision from stronger teachers. We identify a failure mode of OPD: as training progresses, on-policy rollouts can undergo abrupt length inflation, causing truncated trajectories to dominate the training data. This truncation collapse coincides with abrupt repetition saturation and induces biased gradient signals, leading to severe training instability and sharp degradation in validation performance. We attribute this problem to the interaction between student-induced data collection and the distillation objective, which implicitly favors long and repetitive rollouts. To address this issue, we propose StableOPD, a stabilized OPD framework that combines a reference-based divergence constraint with rollout mixture distillation. These together mitigate repetition-induced length inflation and further stabilize OPD training. Across multiple math reasoning datasets, our approach prevents truncation collapse, stabilizes training dynamics, and improves performance by 7.2% on average.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.