ArXiv TLDR

MotuBrain: An Advanced World Action Model for Robot Control

🐦 Tweet
2604.27792

MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan + 15 more

cs.RO

TLDR

MotuBrain is a unified multimodal generative model for robot control, jointly modeling video and action with a UniDiffuser architecture.

Key contributions

  • Unified multimodal generative model for video and action using UniDiffuser.
  • Supports multiple inference modes: policy learning, world modeling, video generation.
  • Scales to diverse data, including video-only and cross-embodiment robot data.
  • Achieves over 50x speedup for real-time robot control deployment.

Why it matters

VLA models often lack fine-grained world dynamics. MotuBrain addresses this by unifying video and action modeling, improving robot control. Its efficiency and diverse capabilities make it highly applicable for real-world robotic systems.

Original Abstract

Vision-Language-Action (VLA) models achieve strong semantic generalization but often lack fine-grained modeling of world dynamics. Recent work explores video generation models as a foundation for world modeling, leading to unified World Action Models (WAMs) that jointly model visual dynamics and actions. We present MotuBrain, a unified multimodal generative model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports multiple inference modes, including policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only and cross-embodiment robot data. To improve real-world applicability, MotuBrain introduces a unified multiview representation, explicit language-action coupling, and an efficient inference stack, achieving over 50x speedup for real-time deployment.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.