ArXiv TLDR

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

🐦 Tweet
2605.12400

Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

cs.LGcs.AI

TLDR

OGLS-SD enhances LLM reasoning by using outcome-guided logit steering to correct teacher-student mismatches in on-policy self-distillation.

Key contributions

  • Addresses teacher-student mismatch in on-policy self-distillation (OPSD) caused by reflection bias.
  • Proposes OGLS-SD, an outcome-guided logit steering framework for LLM reasoning.
  • Calibrates teacher logits using verifiable outcome rewards from successful and failed trajectories.
  • Stabilizes self-distillation and improves reasoning performance over standard OPSD.

Why it matters

On-policy self-distillation is vital for LLM reasoning, but teacher-student mismatches limit its potential. OGLS-SD provides a robust solution, stabilizing the process and significantly boosting reasoning performance, making LLMs more reliable for complex tasks.

Original Abstract

We study {on-policy self-distillation} (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite the performance gains of OPSD, we identify a common but often overlooked mismatch between teacher and student responses: self-reflected teacher responses can be shifted by reflection-induced bias and response templates, leading to miscalibrated token-level supervision. To mitigate this issue, we propose \methodname, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to contrast successful and failed on-policy trajectories and calibrate teacher logits. By combining outcome-level correctness with dense token-level guidance through logit steering, \methodname stabilizes self-distillation and improves reasoning performance over standard OPSD and other variants across diverse benchmarks.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.