ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov + 2 more
TLDR
ActionParty is a new video world model that enables multi-subject action control in generative video games by disentangling subject states.
Key contributions
- Proposes ActionParty, a multi-subject world model for generative video games.
- Introduces subject state tokens and a spatial biasing mechanism for action binding.
- Disentangles global video rendering from individual action-controlled subject updates.
- Controls up to seven players simultaneously across 46 environments with improved accuracy and consistency.
Why it matters
This paper addresses a key limitation of current video diffusion models by enabling multi-agent control, which is crucial for complex interactive environments. It paves the way for more sophisticated and realistic generative video games and simulations with multiple interacting entities.
Original Abstract
Recent advances in video diffusion have enabled the development of "world models" capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.