ArXiv TLDR

TIP: Token Importance in On-Policy Distillation

🐦 Tweet
2604.14084

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang + 1 more

cs.LGcs.AI

TLDR

TIP introduces a two-axis taxonomy for token importance in on-policy distillation, significantly improving efficiency and reducing memory usage.

Key contributions

  • Identifies two key regions for informative tokens in OPD: high student entropy and low entropy with high teacher-student divergence.
  • Shows entropy-based sampling (50% tokens) matches full training, reducing peak memory by up to 47%.
  • Demonstrates that <10% of low-entropy, high-divergence tokens provide dense corrective signal.
  • Introduces TIP, a two-axis taxonomy for token importance, guiding more efficient distillation.

Why it matters

This paper significantly improves on-policy knowledge distillation by identifying and leveraging the most informative tokens. It enables more efficient training of student models, drastically reducing memory usage and computational costs, especially for larger models. This makes advanced distillation techniques more accessible and scalable.

Original Abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $&lt;$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.