ArXiv TLDR

Render-in-the-Loop: Vector Graphics Generation via Visual Self-Feedback

🐦 Tweet
2604.20730

Guotao Liang, Zhangcheng Wang, Juncheng Hu, Haitao Zhou, Ziteng Xue + 3 more

cs.CV

TLDR

Render-in-the-Loop introduces a novel paradigm for SVG generation, enabling MLLMs to use visual self-feedback for improved accuracy and efficiency.

Key contributions

  • Introduces Render-in-the-Loop, a paradigm for SVG generation using visual self-feedback from intermediate renders.
  • Employs Visual Self-Feedback (VSF) training to condition primitive generation on evolving visual states.
  • Proposes Render-and-Verify (RaV) inference to effectively filter degenerate and redundant SVG primitives.
  • Achieves state-of-the-art performance on Text-to-SVG and Image-to-SVG benchmarks.

Why it matters

This paper addresses a critical limitation in MLLM-based SVG generation by integrating visual feedback, transforming it from a textual task to a visuo-spatial one. By enabling models to "see" their progress, it significantly improves accuracy and efficiency. This approach has broad implications for creative AI and design tools.

Original Abstract

Multimodal Large Language Models (MLLMs) have shown promising capabilities in generating Scalable Vector Graphics (SVG) via direct code synthesis. However, existing paradigms typically adopt an open-loop "blind drawing" approach, where models generate symbolic code sequences without perceiving intermediate visual outcomes. This methodology severely underutilizes the powerful visual priors embedded in MLLMs vision encoders, treating SVG generation as a disjointed textual sequence modeling task rather than an integrated visuo-spatial one. Consequently, models struggle to reason about partial canvas states and implicit occlusion relationships, which are visually explicit but textually ambiguous. To bridge this gap, we propose Render-in-the-Loop, a novel generation paradigm that reformulates SVG synthesis as a step-wise, visual-context-aware process. By rendering intermediate code states into a cumulative canvas, the model explicitly observes the evolving visual context at each step, leveraging on-the-fly feedback to guide subsequent generation. However, we demonstrate that applying this visual loop naively to off-the-shelf models is suboptimal due to their inability to leverage incremental visual-code mappings. To address this, we first utilize fine-grained path decomposition to construct dense multi-step visual trajectories, and then introduce a Visual Self-Feedback (VSF) training strategy to condition the next primitive generation on intermediate visual states. Furthermore, a Render-and-Verify (RaV) inference mechanism is proposed to effectively filter degenerate and redundant primitives. Our framework, instantiated on a multimodal foundation model, outperforms strong open-weight baselines on the standard MMSVGBench. This result highlights the remarkable data efficiency and generalization capability of our Render-in-the-Loop paradigm for both Text-to-SVG and Image-to-SVG tasks.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.