From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
Yajie Li, Bozhou Zhang, Chun Gu, Zipei Ma, Jiahui Zhang + 3 more
TLDR
MoLA transforms imagined robot manipulation videos into executable actions by inferring a mixture of latent actions via inverse dynamics models.
Key contributions
- Introduces MoLA, a control-oriented interface for robot manipulation using imagined future videos.
- Infers a mixture of latent actions from generated visual transitions using pretrained inverse dynamics models.
- Leverages complementary semantic, depth, and flow cues for physically grounded action representations.
- Achieves consistent gains in task success, temporal consistency, and generalization on robot tasks.
Why it matters
Robot manipulation often struggles to translate imagined visual futures into concrete actions due to a mismatch between visual realism and control relevance. MoLA addresses this by providing a structured, physically grounded action representation, bridging the gap between video generation and policy execution. This significantly improves robot control stability and task success.
Original Abstract
Video generation models offer a promising imagination mechanism for robot manipulation by predicting long-horizon future observations, but effectively exploiting these imagined futures for action execution remains challenging. Existing approaches either condition policies on predicted frames or directly decode generated videos into actions, both suffering from a mismatch between visual realism and control relevance. As a result, predicted observations emphasize perceptual fidelity rather than action-centric causes of state transitions, leading to indirect and unstable control. To address this gap, we propose MoLA (Mixture of Latent Actions), a control-oriented interface that transforms imagined future videos into executable representations. Instead of passing predicted frames directly to the policy, MoLA leverages a mixture of pretrained inverse dynamics models to infer a mixture of latent actions implied by generated visual transitions. These modality-aware inverse dynamics models capture complementary semantic, depth, and flow cues, providing a structured and physically grounded action representation that bridges video imagination and policy execution. We evaluate our approach on simulated benchmarks (LIBERO, CALVIN, and LIBERO-Plus) and real-world robot manipulation tasks, achieving consistent gains in task success, temporal consistency, and generalization.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.