ArXiv TLDR

Large Language Models are Universal Reasoners for Visual Generation

🐦 Tweet
2605.04040

Sucheng Ren, Chen Chen, Zhenbang Wang, Liangchen Song, Xiangxin Zhu + 3 more

cs.CV

TLDR

UniReasoner uses LLMs as universal reasoners to close the understanding-generation gap in text-to-image models via self-critiqued visual drafts.

Key contributions

  • Identifies the "understanding-generation gap" where LLMs understand prompts but fail to generate faithfully.
  • Proposes UniReasoner, where an LLM creates a coarse visual draft from a text prompt.
  • The LLM then self-critiques this draft, generating grounded textual evaluations for corrections.
  • A diffusion model generates the final image, guided by the prompt, draft, and LLM's corrective evaluation.

Why it matters

UniReasoner significantly improves text-to-image generation by integrating LLM reasoning to bridge the understanding-generation gap. It leads to more faithful and compositionally aligned images from diffusion models. This work offers a practical way to leverage LLM's reasoning power for visual synthesis.

Original Abstract

Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural unification, these systems frequently fail to faithfully align complex prompts during synthesis, even though they remain highly accurate at verifying whether an image satisfies those same prompts. We formalize this as the \emph{understanding-generation gap} and propose UniReasoner, a framework that leverages the LLM as a universal reasoner to convert its understanding strength into direct generation guidance. Given a prompt, the LLM first produces a coarse visual draft composed of discrete vision tokens. It then performs a self-critique by evaluating the draft for prompt consistency, producing a grounded textual evaluation that pinpoints what needs to be corrected. Finally, a diffusion model is conditioned jointly on the prompt, the visual draft, and the evaluation, ensuring that generation is guided by explicit corrective signals. Each signal addresses a limitation of the other: the draft provides a concrete, scene-level anchor that reduces under-specification in text-only conditioning, while the evaluation turns verification into grounded, actionable constraints that correct omissions, hallucinations, and relational errors. Experiments show that UniReasoner improves compositional alignment and semantic faithfulness under the same diffusion backbone while maintaining image quality, demonstrating a practical way to exploit LLM reasoning to close the understanding-generation gap.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.