Yuanda Xu

2 papers · Latest: May 12, 2026

Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

A new principle for LM post-training uses sparse rewards for strong teachers and dense distillation for students, outperforming direct sparse RL.

TIP introduces a two-axis taxonomy for token importance in on-policy distillation, significantly improving efficiency and reducing memory usage.

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.