ArXiv TLDR

LottieGPT: Tokenizing Vector Animation for Autoregressive Generation

🐦 Tweet
2604.11792

Junhao Chen, Kejun Gao, Yuehan Cui, Mingze Sun, Mingjin Chen + 6 more

cs.CV

TLDR

LottieGPT is the first framework that tokenizes and autoregressively generates editable vector animations from prompts using a new tokenizer and large dataset.

Key contributions

  • Presents LottieGPT, the first framework for tokenizing and autoregressively generating vector animations.
  • Introduces a Lottie Tokenizer that encodes complex Lottie animations into compact, semantically aligned token sequences.
  • Curates LottieAnimation-660K, the largest dataset of 660K real-world Lottie animations and 15M static images.
  • Develops LottieGPT, a multimodal model generating editable vector animations from language or visual prompts.

Why it matters

Vector animations are crucial for web and app design due to their resolution-independence and editability, yet no generative AI could produce them. LottieGPT fills this gap by enabling direct generation of these complex, structured assets, opening new possibilities for creative tools and automated content.

Original Abstract

Despite rapid progress in video generation, existing models are incapable of producing vector animation, a dominant and highly expressive form of multimedia on the Internet. Vector animations offer resolution-independence, compactness, semantic structure, and editable parametric motion representations, yet current generative models operate exclusively in raster space and thus cannot synthesize them. Meanwhile, recent advances in large multimodal models demonstrate strong capabilities in generating structured data such as slides, 3D meshes, LEGO sequences, and indoor layouts, suggesting that native vector animation generation may be achievable. In this work, we present the first framework for tokenizing and autoregressively generating vector animations. We adopt Lottie, a widely deployed JSON-based animation standard, and design a tailored Lottie Tokenizer that encodes layered geometric primitives, transforms, and keyframe-based motion into a compact and semantically aligned token sequence. To support large-scale training, we also construct LottieAnimation-660K, the largest and most diverse vector animation dataset to date, consisting of 660k real-world Lottie animation and 15M static Lottie image files curated from broad Internet sources. Building upon these components, we finetune Qwen-VL to create LottieGPT, a native multimodal model capable of generating coherent, editable vector animations directly from natural language or visual prompts. Experiments show that our tokenizer dramatically reduces sequence length while preserving structural fidelity, enabling effective autoregressive learning of dynamic vector content. LottieGPT exhibits strong generalization across diverse animation styles and outperforms previous state-of-the-art models on SVG generation (a special case of single-frame vector animation).

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.