Advantage-Guided Diffusion for Model-Based Reinforcement Learning

April 10, 20262604.09035

Daniele Foffano, Arvid Eriksson, David Broman, Karl H. Johansson, Alexandre Proutiere

cs.AIcs.LG

TLDR

Advantage-Guided Diffusion (AGD-MBRL) steers diffusion world models using advantage estimates to improve long-term returns in model-based RL.

Key contributions

Introduces Advantage-Guided Diffusion (AGD-MBRL) to mitigate short-horizon myopia in diffusion-model MBRL.
Develops two guidance methods: Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG).
Proves AGD-MBRL generates trajectories with higher value, leading to policy improvement.
Achieves up to 2x better sample efficiency and return on MuJoCo tasks compared to baselines.

Why it matters

Existing diffusion world models in MBRL struggle with short-horizon myopia or ignore valuable information. AGD-MBRL solves this by guiding trajectory generation with advantage estimates, ensuring higher long-term returns. This method boosts sample efficiency and performance in model-based RL, making diffusion models more effective.

Original Abstract

Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers