StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning

April 20, 20262604.18401

Daoyu Wang, Qingchuan Li, Mingyue Cheng, Jie Ouyang, Shuo Yu + 2 more

cs.CL

TLDR

StepPO introduces step-aligned policy optimization for Agentic RL, shifting from token-level to step-level MDPs to enhance LLM agent capabilities.

Key contributions

Proposes StepPO, a step-aligned policy optimization framework for Agentic Reinforcement Learning.
Advocates for a step-level Markov Decision Process (MDP) as the proper action representation for LLM agents.
Introduces step-level credit assignment to align policy optimization with agent decision granularity.
Outlines key system designs for practical step-level Agentic RL and shows initial experimental effectiveness.

Why it matters

Agentic RL is vital for enhancing LLM agents in multi-turn interactive settings, but traditional token-level optimization is insufficient. StepPO provides a novel step-aligned paradigm to better capture LLM agent behavior, paving the way for more capable general agents.

Original Abstract

General agents have given rise to phenomenal applications such as OpenClaw and Claude Code. As these agent systems (a.k.a. Harnesses) strive for bolder goals, they demand increasingly stronger agentic capabilities from foundation Large Language Models (LLMs). Agentic Reinforcement Learning (RL) is emerging as a central post-training paradigm for empowering LLMs with these capabilities and is playing an increasingly pivotal role in agent training. Unlike single-turn token-level alignment or reasoning enhancement, as in RLHF and RLVR, Agentic RL targets multi-turn interactive settings, where the goal is to optimize core agentic capabilities such as decision making and tool use while addressing new challenges including delayed and sparse rewards, as well as long and variable context. As a result, the token-centric modeling and optimization paradigm inherited from traditional LLM RL is becoming increasingly inadequate for capturing real LLM agent behavior. In this paper, we present StepPO as a position on step-level Agentic RL. We argue that the conventional token-level Markov Decision Process (MDP) should be advanced to a step-level MDP formulation, and that the step, rather than the token, should be regarded as the proper action representation for LLM agents. We then propose step-level credit assignment as the natural optimization counterpart of this formulation, thereby aligning policy optimization and reward propagation with the granularity of agent decisions. Finally, we discuss the key systems designs required to realize step-level Agentic RL in practice and preliminary experiments provide initial evidence for the effectiveness of this perspective. We hope that the step-aligned, step-level paradigm embodied in StepPO offers the Agentic RL community a useful lens for understanding agent behavior and helps advance LLMs toward stronger general-agent capabilities.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers