Near-Future Policy Optimization

April 22, 20262604.20733

Chuanyu Qin, Chenxu Yang, Qingyi Si, Naibin Gu, Dingyu Yao + 4 more

cs.LG

TLDR

NPO and AutoNPO enhance Reinforcement Learning with Verifiable Rewards (RLVR) by leveraging near-future policy checkpoints for improved off-policy learning.

Key contributions

Proposes Near-Future Policy Optimization (NPO) using later checkpoints from the same training run for off-policy learning.
Balances trajectory quality (higher Q) against variance cost (lower V) for maximizing effective learning signals.
Introduces AutoNPO, an adaptive variant that automatically triggers interventions and selects optimal guide checkpoints.
Achieves significant performance gains and faster convergence on Qwen3-VL-8B-Instruct with GRPO.

Why it matters

This paper tackles the critical challenge of sourcing effective off-policy trajectories in Reinforcement Learning with Verifiable Rewards (RLVR). By introducing NPO and AutoNPO, it provides a novel and efficient way to accelerate convergence and raise the performance ceiling in RLVR systems.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a core post-training recipe. Introducing suitable off-policy trajectories into on-policy exploration accelerates RLVR convergence and raises the performance ceiling, yet finding a source of such trajectories remains the key challenge. Existing mixed-policy methods either import trajectories from external teachers (high-quality but distributionally far) or replay past training trajectories (close but capped in quality), and neither simultaneously satisfies the strong enough (higher $Q$ , more new knowledge to learn) and close enough (lower $V$ , more readily absorbed) conditions required to maximize the effective learning signal $\mathcal{S} = Q/V$. We propose \textbf{N}ear-Future \textbf{P}olicy \textbf{O}ptimization (\textbf{NPO}), a simple mixed-policy scheme that learns from a policy's own near-future self: a later checkpoint from the same training run is a natural source of auxiliary trajectories that is both stronger than the current policy and closer than any external source, directly balancing trajectory quality against variance cost. We validate NPO through two manual interventions, early-stage bootstrapping and late-stage plateau breakthrough, and further propose \textbf{AutoNPO},an adaptive variant that automatically triggers interventions from online training signals and selects the guide checkpoint that maximizes $S$. On Qwen3-VL-8B-Instruct with GRPO, NPO improves average performance from 57.88 to 62.84, and AutoNPO pushes it to 63.15, raising the final performance ceiling while accelerating convergence.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers