EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

May 7, 20262605.06192

Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, Kai Chen

cs.CVcs.AIcs.RO

TLDR

EA-WM is a generative world model that uses structured kinematic-to-visual action fields to improve robot interaction dynamics and geometry in generated videos.

Key contributions

Addresses the problem of poor robot geometry and interaction dynamics in generated rollouts from existing world models.
Introduces Structured Kinematic-to-Visual Action Fields, projecting actions directly into the camera view.
Employs event-aware bidirectional fusion blocks to capture object state changes and interaction dynamics.
Achieves state-of-the-art performance on the WorldArena benchmark, significantly outperforming baselines.

Why it matters

This paper introduces a novel approach to robotic world modeling, directly integrating kinematic actions into visual generation. It significantly improves the realism of robot-object interactions and spatial geometry in generated videos. This advancement is crucial for developing more capable and reliable robot learning systems.

Original Abstract

Pretrained video diffusion models provide powerful spatiotemporal generative priors, making them a natural foundation for robotic world models. While recent world-action models jointly optimize future videos and actions, they predominantly treat video generation as an auxiliary representation for policy learning. Consequently, they insufficiently explore the inverse problem: leveraging action signals to guide video synthesis, thereby often failing to preserve precise robot spatial geometry and fine-grained robot-object interaction dynamics in the generated rollouts. To bridge this gap, we present EA-WM, an Event-Aware Generative World Model that effectively closes the loop between kinematic control and visual perception. Rather than injecting joint or end-effector actions as abstract, low-dimensional tokens, EA-WM projects actions and kinematic states directly into the target camera view as Structured Kinematic-to-Visual Action Fields. To fully exploit this geometrically grounded representation, we introduce event-aware bidirectional fusion blocks that modulate cross-branch attention, capturing object state changes and interaction dynamics. Evaluated on the comprehensive WorldArena benchmark, EA-WM achieves state-of-the-art performance, outperforming existing baselines by a significant margin.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers