LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

April 22, 20262604.20796

Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng + 13 more

cs.CV

TLDR

LLaDA2.0-Uni is a unified discrete diffusion LLM that excels in multimodal understanding, generation, and editing, setting a new paradigm.

Key contributions

Unified architecture with discrete tokenizer, MoE-dLLM backbone, and diffusion decoder.
Uses SigLIP-VQ to discretize visual inputs, enabling masked diffusion for text and vision.
Achieves high efficiency via prefix-aware optimizations and few-step decoder distillation.
Matches specialized VLMs in understanding and performs strongly in image generation/editing.

Why it matters

LLaDA2.0-Uni offers a scalable, unified framework for multimodal AI, natively supporting interleaved generation and reasoning. It sets a promising paradigm for future foundation models by matching specialized VLMs while performing strong image generation.

Original Abstract

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers