VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

April 23, 20262604.21914

Songen Gu, Yuhang Zheng, Weize Li, Yupeng Zheng, Yating Feng + 4 more

cs.RO

TLDR

VistaBot enhances robot manipulation's view robustness by combining geometric models with video diffusion for closed-loop control.

Key contributions

Develops VistaBot, a framework for view-robust robot manipulation using geometry-aware view synthesis.
Introduces a latent action planner that integrates 4D geometry and video diffusion for control.
Proposes the View Generalization Score (VGS) metric, showing 2.79x improvement over baselines.
Enables closed-loop manipulation robust to viewpoint changes without test-time camera calibration.

Why it matters

This paper addresses a critical limitation in end-to-end robotic manipulation: poor robustness to camera viewpoint changes. By integrating geometric and diffusion models, VistaBot enables robots to perform tasks reliably from various perspectives. This advancement is crucial for deploying generalizable robots in real-world, dynamic environments.

Original Abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based ($π_0$) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79$\times$ and 2.63$\times$ over ACT and $π_0$, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers