ArXiv TLDR

Advantage-Guided Diffusion for Model-Based Reinforcement Learning

🐦 Tweet
2604.09035

Daniele Foffano, Arvid Eriksson, David Broman, Karl H. Johansson, Alexandre Proutiere

cs.AIcs.LG

TLDR

Advantage-Guided Diffusion (AGD-MBRL) steers diffusion world models using advantage estimates to improve long-term returns in model-based RL.

Key contributions

  • Introduces Advantage-Guided Diffusion (AGD-MBRL) to mitigate short-horizon myopia in diffusion-model MBRL.
  • Develops two guidance methods: Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG).
  • Proves AGD-MBRL generates trajectories with higher value, leading to policy improvement.
  • Achieves up to 2x better sample efficiency and return on MuJoCo tasks compared to baselines.

Why it matters

Existing diffusion world models in MBRL struggle with short-horizon myopia or ignore valuable information. AGD-MBRL solves this by guiding trajectory generation with advantage estimates, ensuring higher long-term returns. This method boosts sample efficiency and performance in model-based RL, making diffusion models more effective.

Original Abstract

Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.