ArXiv TLDR

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

🐦 Tweet
2604.28185

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu + 22 more

cs.CV

TLDR

This paper proposes a new five-level taxonomy for visual generation, shifting from appearance synthesis to intelligent, agentic world modeling.

Key contributions

  • Introduces a five-level taxonomy for visual generation, from passive renderers to agentic world-modeling generators.
  • Argues for shifting from appearance synthesis to intelligent visual generation grounded in structure and causality.
  • Analyzes key technical drivers like flow matching, unified models, and improved visual representations.
  • Critiques current evaluations for overemphasizing perceptual quality and missing structural/causal failures.

Why it matters

This paper provides a critical roadmap for the future of visual generation. It redefines the field's goals, moving beyond simple photorealism to intelligent, world-aware systems. This shift is crucial for developing truly capable and robust visual AI.

Original Abstract

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.