Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Kaiyuan Liu, Ziyuan Zhuang, Yang Bai, Bing Wang, Rongxiang Weng + 1 more
TLDR
A new on-policy distillation method, "Prefix Teach, Suffix Fade," improves strong-to-weak model training by focusing supervision on locally teachable trajectory segments.
Key contributions
- Identifies "local teachability collapse" where later segments of generated trajectories lack discriminative teacher feedback.
- Proposes a "trajectory-specific release rule" to truncate dense OPD supervision when local teachability collapses.
- Rule measures teacher's margin over student's top-K candidates, aggregated across sentence segments, to detect collapse.
- Consistently outperforms standard full-trajectory OPD and better preserves out-of-domain capabilities on Qwen3 models.
Why it matters
Current on-policy distillation often assumes uniform utility of teacher feedback, which this paper challenges. By identifying and addressing "local teachability collapse," it enables more efficient and effective strong-to-weak model training. This leads to student models that perform better both in-domain and on out-of-domain tasks.
Original Abstract
On-policy distillation (OPD) trains a student model on its own rollouts using dense feedback from a stronger teacher. Prior literature suggests that, provided teacher feedback is available, supervising the full sequence of response tokens should monotonically improve performance. However, we demonstrate that this assumption sometimes fails to hold in strong-to-weak OPD settings. While later segments of a generated trajectory may still exhibit a non-zero teacher-student advantage, they frequently lack the local contrast that makes dense feedback effective for prioritizing student learning. We term this failure mode local teachability collapse. The resulting principle is straightforward: supervision should concentrate on trajectory regions where the teacher's feedback remains discriminative, rather than uniformly covering the entire response. We operationalize this principle through a trajectory-specific release rule. This rule measures the teacher's margin over the student's top-$K$ candidate set, aggregates this margin across NLTK-tokenized sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experimental results across strong-to-weak distillation tasks using the Qwen3 model family indicate that this release rule consistently outperforms standard full-trajectory OPD across five in-domain benchmarks at various student scales. Furthermore, compared to baseline distillation methods, our approach better preserves model capabilities on out-of-domain task. These results suggest that effective strong-to-weak OPD requires evaluating not only the availability of teacher guidance but also its local utility, ensuring that the generated feedback remains teachable.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.