CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
Yihao Meng, Zichen Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng + 9 more
TLDR
CausalCine is a real-time autoregressive framework for generating multi-shot video narratives, enabling interactive, coherent storytelling across shot changes.
Key contributions
- CausalCine enables interactive, real-time autoregressive generation of multi-shot video narratives.
- Trains a causal base model on multi-shot sequences to learn complex shot transitions effectively.
- Introduces Content-Aware Memory Routing (CAMR) for dynamic context retrieval and cross-shot coherence.
- Distills the base model into a few-step generator for real-time interactive video synthesis.
Why it matters
Existing autoregressive models struggle with multi-shot cinematic storytelling, leading to stagnation and semantic drift. CausalCine solves this by enabling coherent, interactive, and real-time generation across evolving events and viewpoint shifts. It significantly outperforms baselines, unlocking streaming interactivity for complex video narrative creation.
Original Abstract
Autoregressive video generation aims at real-time, open-ended synthesis. Yet, cinematic storytelling is not merely the endless extension of a single scene; it requires progressing through evolving events, viewpoint shifts, and discrete shot boundaries. Existing autoregressive models often struggle in this setting. Trained primarily for short-horizon continuation, they treat long sequences as extended single shots, inevitably suffering from motion stagnation and semantic drift during long rollouts. To bridge this gap, we introduce CausalCine, an interactive autoregressive framework that transforms multi-shot video generation into an online directing process. CausalCine generates causally across shot changes, accepts dynamic prompts on the fly, and reuses context without regenerating previous shots. To achieve this, we first train a causal base model on native multi-shot sequences to learn complex shot transitions prior to acceleration. We then propose Content-Aware Memory Routing (CAMR), which dynamically retrieves historical KV entries according to attention-based relevance scores rather than temporal proximity, preserving cross-shot coherence under bounded active memory. Finally, we distill the causal base model into a few-step generator for real-time interactive generation. Extensive experiments demonstrate that CausalCine significantly outperforms autoregressive baselines and approaches the capability of bidirectional models while unlocking the streaming interactivity of causal generation. Demo available at https://yihao-meng.github.io/CausalCine/
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.