ArXiv TLDR

Beyond Distribution Sharpening: The Importance of Task Rewards

🐦 Tweet
2604.16259

Sarthak Mittal, Leo Gagnon, Guillaume Lajoie

cs.LGcs.AI

TLDR

This paper shows task-reward-based RL is crucial for model performance, outperforming distribution sharpening which has inherent limitations.

Key contributions

  • Explicitly compares distribution sharpening with task-reward-based learning using RL.
  • Reveals fundamental limitations and instability of distribution sharpening from first principles.
  • Shows task-based rewards achieve robust performance improvements and stable learning on math datasets.

Why it matters

This research clarifies the debate on RL's role, showing task-reward-based learning is crucial for robust, stable model performance. It demonstrates the limitations of distribution sharpening, guiding future development of sophisticated AI agents.

Original Abstract

Frontier models have demonstrated exceptional capabilities following the integration of task-reward-based reinforcement learning (RL) into their training pipelines, enabling systems to evolve from pure reasoning models into sophisticated agents. However, debate persists regarding whether RL genuinely instills new skills within a base model or merely sharpens its existing distribution to elicit latent capabilities. To address this dichotomy, we present an explicit comparison between distribution sharpening and task-reward-based learning, utilizing RL as a tool to implement both paradigms. Our analysis reveals the inherent limitations of distribution sharpening, demonstrating from first principles how and why the optima can be unfavorable and the approach fundamentally unstable. Furthermore, our experiments using Llama-3.2-3B-Instruct, Qwen2.5-3B-Instruct and Qwen3-4B-Instruct-2507 on math datasets confirm that sharpening yields limited gains, whereas incorporating task-based reward signal can greatly help achieve robust performance improvements and stable learning.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.