Recovering Hidden Reward in Diffusion-Based Policies

May 1, 20262605.00623

Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie + 5 more

cs.RO

TLDR

EnergyFlow unifies generative action modeling with inverse RL to extract rewards from diffusion policies without adversarial training.

Key contributions

Introduces EnergyFlow, unifying generative action modeling and inverse RL via a scalar energy function.
Proves denoising score matching recovers expert's soft Q-function gradient for reward extraction.
Conservative field constraints reduce hypothesis complexity and tighten out-of-distribution generalization.
Achieves state-of-the-art imitation and provides effective reward for downstream reinforcement learning.

Why it matters

EnergyFlow offers a novel, efficient way to extract hidden rewards from diffusion policies, simplifying inverse reinforcement learning by avoiding adversarial training. This makes reward recovery more robust and improves policy generalization. It achieves state-of-the-art imitation performance, providing effective rewards for downstream RL.

Original Abstract

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers