Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
Peiyan Li, Yixiang Chen, Yuan Xu, Jiabing Yang, Xiangnan Wu + 11 more
TLDR
MV-VDP is a multi-view video diffusion policy that jointly models 3D spatio-temporal states for data-efficient, robust robotic manipulation.
Key contributions
- Predicts multi-view heatmap and RGB videos, aligning pretraining with action finetuning.
- Achieves data-efficient manipulation with only ten demonstrations, without additional pretraining.
- Demonstrates strong robustness, generalization to out-of-distribution settings, and realistic future video prediction.
- Outperforms existing video-prediction, 3D, and vision-language-action models.
Why it matters
Robotic manipulation policies often struggle with 3D spatial and temporal understanding, requiring vast amounts of data. MV-VDP addresses this by integrating 3D spatio-temporal awareness, significantly reducing data needs. This advancement makes complex robotic tasks more accessible and efficient.
Original Abstract
Robotic manipulation requires understanding both the 3D spatial structure of the environment and its temporal evolution, yet most existing policies overlook one or both. They typically rely on 2D visual observations and backbones pretrained on static image--text pairs, resulting in high data requirements and limited understanding of environment dynamics. To address this, we introduce MV-VDP, a multi-view video diffusion policy that jointly models the 3D spatio-temporal state of the environment. The core idea is to simultaneously predict multi-view heatmap videos and RGB videos, which 1) align the representation format of video pretraining with action finetuning, and 2) specify not only what actions the robot should take, but also how the environment is expected to evolve in response to those actions. Extensive experiments show that MV-VDP enables data-efficient, robust, generalizable, and interpretable manipulation. With only ten demonstration trajectories and without additional pretraining, MV-VDP successfully performs complex real-world tasks, demonstrates strong robustness across a range of model hyperparameters, generalizes to out-of-distribution settings, and predicts realistic future videos. Experiments on Meta-World and real-world robotic platforms demonstrate that MV-VDP consistently outperforms video-prediction--based, 3D-based, and vision--language--action models, establishing a new state of the art in data-efficient multi-task manipulation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.