Mask World Model: Predicting What Matters for Robust Robot Policy Learning

April 21, 20262604.19683

Yunfan Lou, Xiaowei Chi, Xiaojie Zhang, Zezhong Qian, Chengxuan Li + 7 more

cs.RO

TLDR

Mask World Model predicts semantic masks instead of pixels for robust robot policy learning, outperforming RGB-based world models.

Key contributions

Introduces Mask World Model (MWM) predicting semantic masks to filter visual noise in robot policy learning.
Leverages video diffusion architectures to create a geometric information bottleneck, focusing on essential physical dynamics.
Integrates mask dynamics with a diffusion-based policy head for robust end-to-end robot control.
Achieves state-of-the-art performance on LIBERO and RLBench, showing superior generalization and robustness.

Why it matters

Current robot world models overfit to visual noise, leading to fragile policies. MWM solves this by predicting semantic masks, significantly boosting robustness and generalization, which is crucial for building reliable and adaptable generalist robot systems.

Original Abstract

World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers