How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum
TLDR
This paper introduces a Tsallis loss family ($J_Q$) that mitigates cold-start stalling in reasoning models by interpolating between exploitation and density estimation.
Key contributions
- Introduces Tsallis $q$-logarithm loss family ($J_Q$) to interpolate between RLVR and log-marginal-likelihood for reasoning models.
- Shows $J_Q$ addresses cold-start stalling: $q=1$ (density-estimation) escapes faster than $q=0$ (exploitation).
- Develops two Monte Carlo estimators, GARL and PAFT, to implement the intractable $P_θ$ amplification.
- GARL ($q=0.75$) significantly mitigates cold-start stalling on FinQA, HotPotQA, and MuSiQue, outperforming GRPO.
Why it matters
Current reinforcement learning methods for reasoning models often stall when initial success is low, limiting their adaptability. This paper introduces a novel loss family that explicitly addresses this "cold-start" problem. It provides practical algorithms (GARL, PAFT) that significantly improve training stability and performance, especially in challenging low-success scenarios.
Original Abstract
Adapting reasoning models to new tasks during post-training with only output-level supervision stalls under reinforcement learning from verifiable rewards (RLVR) when the initial success probability $p_0$ is small. Using the Tsallis $q$-logarithm, we define a loss family $J_Q$ that interpolates between RLVR (at $q{=}0$, the exploitation pole) and the log-marginal-likelihood over latent trajectories (at $q{=}1$, the density-estimation pole). All members share the same per-example gradient direction, differing only by a scalar amplification $P_{θ^{-q}}$ that reweights each instance independently of the learning rate. This amplification is the mechanism that addresses cold-start stalling: under gradient flow, the exploitation pole requires $Ω(\frac{1}{p_0})$ time to escape cold start, while the density-estimation pole escapes in $Θ\big(\log(\frac{1}{p_0})\big)$; intermediate $q$ trades escape speed against noise memorization. Because $P_θ$ is intractable, we derive two Monte Carlo estimators from the two factorizations of the gradient: Gradient-Amplified RL (GARL) samples from the prior and amplifies the RL gradient, and Posterior-Attenuated Fine-Tuning (PAFT) importance-resamples from the posterior and runs standard SFT. Both have bias $O\big(\frac{q}{M P_θ^{q+1}}\big)$; GARL has lower variance, PAFT has semantically coherent gradients. On FinQA, HotPotQA, and MuSiQue, GARL at $q{=}0.75$ substantially mitigates cold-start stalling, escaping cold start where GRPO fails entirely. In warm start, GARL at low $q$ dominates FinQA where training is stable; on HotPotQA and MuSiQue, GARL destabilizes during training, and PAFT at $q{=}0.75$ provides stable gradients (best overall on HotPotQA at 47.9 maj@16, $+14.4$ over GRPO).
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.