ActionParty: Multi-Subject Action Binding in Generative Video Games

April 2, 20262604.02330

Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov + 2 more

cs.CVcs.AIcs.LG

TLDR

ActionParty is a new video world model that enables multi-subject action control in generative video games by disentangling subject states.

Key contributions

Proposes ActionParty, a multi-subject world model for generative video games.
Introduces subject state tokens and a spatial biasing mechanism for action binding.
Disentangles global video rendering from individual action-controlled subject updates.
Controls up to seven players simultaneously across 46 environments with improved accuracy and consistency.

Why it matters

This paper addresses a key limitation of current video diffusion models by enabling multi-agent control, which is crucial for complex interactive environments. It paves the way for more sophisticated and realistic generative video games and simulations with multiple interacting entities.

Original Abstract

Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers