Recovering Hidden Reward in Diffusion-Based Policies
Yanbiao Ji, Qiuchang Li, Yuting Hu, Shaokai Wu, Wenyuan Xie + 5 more
TLDR
EnergyFlow unifies generative action modeling with inverse RL to extract rewards from diffusion policies without adversarial training.
Key contributions
- Introduces EnergyFlow, unifying generative action modeling and inverse RL via a scalar energy function.
- Proves denoising score matching recovers expert's soft Q-function gradient for reward extraction.
- Conservative field constraints reduce hypothesis complexity and tighten out-of-distribution generalization.
- Achieves state-of-the-art imitation and provides effective reward for downstream reinforcement learning.
Why it matters
EnergyFlow offers a novel, efficient way to extract hidden rewards from diffusion policies, simplifying inverse reinforcement learning by avoiding adversarial training. This makes reward recovery more robust and improves policy generalization. It achieves state-of-the-art imitation performance, providing effective rewards for downstream RL.
Original Abstract
This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.