ArXiv TLDR

From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space

🐦 Tweet
2604.14142

Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang + 3 more

cs.LGcs.AIcs.CL

TLDR

PreRL and DSRL enhance LLM reasoning by applying reward-driven updates in the pre-train space, effectively pruning incorrect reasoning paths.

Key contributions

  • Introduces PreRL, applying reward-driven online updates directly to the marginal distribution P(y) in pre-train space.
  • Demonstrates strong gradient alignment between log P(y) and log P(y|x), validating PreRL as an RL surrogate.
  • Uncovers Negative Sample Reinforcement (NSR) within PreRL as a key driver for pruning incorrect reasoning.
  • Proposes Dual Space RL (DSRL), a "Policy Reincarnation" strategy combining NSR-PreRL and standard RL.

Why it matters

This paper tackles LLM reasoning limitations by optimizing the marginal distribution P(y) in the pre-train space. It introduces PreRL and DSRL, a novel strategy that uses Negative Sample Reinforcement to efficiently prune incorrect reasoning paths, significantly enhancing LLM capabilities.

Original Abstract

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.