ArXiv TLDR

Natural Policy Gradient as Doubly Smoothed Policy Iteration: A Bellman-Operator Framework

🐦 Tweet
2605.10671

Phalguni Nanda, Zaiwei Chen

cs.LGmath.OCstat.ML

TLDR

This paper shows Natural Policy Gradient is a doubly smoothed policy iteration, proving its global geometric convergence and optimal complexity.

Key contributions

  • Formulates Natural Policy Gradient as Doubly Smoothed Policy Iteration (DSPI), a unified Bellman-operator framework.
  • Proves global geometric convergence for DSPI using smoothed Bellman operators, without extra regularization.
  • Establishes NPG's iteration complexity as ((1-γ)⁻¹log((1-γ)⁻¹ε⁻¹)) for ε-optimal policies.
  • Shows finite termination for Dual-Averaged Policy Iteration and extends framework to linear function approximation.

Why it matters

This work provides a fundamental theoretical understanding of Natural Policy Gradient by unifying it under a new Bellman-operator framework. The established global geometric convergence and optimal complexity bounds are significant. It offers new insights into core RL algorithms and their efficiency.

Original Abstract

In this work, we show that natural policy gradient, a core algorithm in reinforcement learning, admits an exact formulation as a smoothed and averaged form of policy iteration. Specifically, we introduce doubly smoothed policy iteration (DSPI), a Bellman-operator framework in which each policy is obtained by applying a regularized greedy step to a weighted average of past $Q$-functions. DSPI includes policy iteration, dual-averaged policy iteration, natural policy gradient, and more general policy dual averaging methods as special cases. Using only monotonicity and contraction of smoothed Bellman operators, we prove distribution-free global geometric convergence of DSPI. Consequently, standard natural policy gradient and policy dual averaging achieve an iteration complexity of $\mathcal{O}((1-γ)^{-1}\log((1-γ)^{-1}ε^{-1}))$ for computing an $ε$-optimal policy, without modifying the MDP, adding regularization beyond the mirror map inherent in the update, or using adaptive, trajectory-dependent stepsizes. For the unregularized greedy case, corresponding to dual-averaged policy iteration, we also prove finite termination. The same Bellman-operator framework further extends to discounted MDPs with linear function approximation and stochastic shortest path problems.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.