dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, Yichen Zhu
TLDR
dWorldEval introduces a discrete diffusion world model for scalable robotic policy evaluation, unifying modalities and outperforming prior methods.
Key contributions
- Utilizes a discrete diffusion world model for scalable robotic policy evaluation.
- Unifies vision, language, and actions into a single token space via a transformer denoising network.
- Introduces a progress token and sparse keyframe memory for task completion and spatiotemporal consistency.
- Significantly outperforms prior methods like WorldEval on LIBERO, RoboTwin, and real-robot tasks.
Why it matters
Evaluating robotics policies across thousands of environments and tasks is currently infeasible. dWorldEval offers a novel, scalable solution using a discrete diffusion world model, unifying diverse modalities for robust evaluation. This breakthrough paves the way for a new paradigm in building world simulators for robotics at scale.
Original Abstract
Evaluating robotics policies across thousands of environments and thousands of tasks is infeasible with existing approaches. This motivates the need for a new methodology for scalable robotics policy evaluation. In this paper, we propose dWorldEval, which uses a discrete diffusion world model as a scalable evaluation proxy for robotics policies. Specifically, dWorldEval maps all modalities - including vision, language, and robotic actions - into a unified token space, modeling them via a single transformer-based denoising network. In this paper, we propose dWorldEval, using a discrete diffusion world model as a scalable evaluation proxy for robotics policy. Specifically, it maps all modalities, including vision, language, and robotics action into a unified token space, then denoises them with a single transformer network. Building on this architecture, we employ a sparse keyframe memory to maintain spatiotemporal consistency. We also introduce a progress token that indicates the degree of task completion. At inference, the model jointly predicts future observations and progress token, allowing automatically determine success when the progress reaches 1. Extensive experiments demonstrate that dWorldEval significantly outperforms previous approaches, i.e., WorldEval, Ctrl-World, and WorldGym, on LIBERO, RoboTwin, and multiple real-robot tasks. It paves the way for a new architectural paradigm in building world simulators for robotics evaluation at scale.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.