Bounded Ratio Reinforcement Learning
Yunke Ao, Le Chen, Bruce D. Lee, Assefa S. Wahd, Aline Czarnobai + 3 more
TLDR
This paper introduces Bounded Ratio Reinforcement Learning (BRRL) and BPO, a new algorithm that theoretically guarantees monotonic improvement and empirically outperforms PPO.
Key contributions
- Introduces Bounded Ratio RL (BRRL) framework, deriving an analytical optimal solution for policy optimization.
- Develops Bounded Policy Optimization (BPO) algorithm, guaranteeing monotonic performance improvement.
- Offers a new theoretical lens for PPO's success, linking it to trust region methods and CEM.
- Empirically, BPO and its LLM extension (GBPO) outperform PPO/GRPO in stability and performance.
Why it matters
PPO, while popular, lacks strong theoretical foundations. This paper introduces BRRL and BPO, offering a theoretically sound approach with guaranteed monotonic improvement. It also provides new insights into PPO's success and shows empirical superiority, making it a significant advancement for robust RL.
Original Abstract
Proximal Policy Optimization (PPO) has become the predominant algorithm for on-policy reinforcement learning due to its scalability and empirical robustness across domains. However, there is a significant disconnect between the underlying foundations of trust region methods and the heuristic clipped objective used in PPO. In this paper, we bridge this gap by introducing the Bounded Ratio Reinforcement Learning (BRRL) framework. We formulate a novel regularized and constrained policy optimization problem and derive its analytical optimal solution. We prove that this solution ensures monotonic performance improvement. To handle parameterized policy classes, we develop a policy optimization algorithm called Bounded Policy Optimization (BPO) that minimizes an advantage-weighted divergence between the policy and the analytic optimal solution from BRRL. We further establish a lower bound on the expected performance of the resulting policy in terms of the BPO loss function. Notably, our framework also provides a new theoretical lens to interpret the success of the PPO loss, and connects trust region policy optimization and the Cross-Entropy Method (CEM). We additionally extend BPO to Group-relative BPO (GBPO) for LLM fine-tuning. Empirical evaluations of BPO across MuJoCo, Atari, and complex IsaacLab environments (e.g., Humanoid locomotion), and of GBPO for LLM fine-tuning tasks, demonstrate that BPO and GBPO generally match or outperform PPO and GRPO in stability and final performance.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.