ArXiv TLDR

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

🐦 Tweet
2604.15309

Yan Li, Zezi Zeng, Yifan Yang, Yuqing Yang, Ning Liao + 10 more

cs.CVcs.AIcs.CL

TLDR

MM-WebAgent is a hierarchical multimodal agent that generates coherent and visually consistent webpages by coordinating AIGC elements through planning and self-reflection.

Key contributions

  • Introduces MM-WebAgent, a hierarchical agent for multimodal webpage generation.
  • Coordinates AIGC elements using hierarchical planning and iterative self-reflection.
  • Jointly optimizes global layout, local content, and their integration for coherence.
  • Outperforms baselines in generating and integrating multimodal webpage elements.

Why it matters

AIGC tools often create inconsistent webpage elements when used in isolation. MM-WebAgent solves this by ensuring global coherence and visual consistency through hierarchical planning and self-reflection. This advances automated UI/UX design by enabling more integrated and high-quality multimodal content generation.

Original Abstract

The rapid progress of Artificial Intelligence Generated Content (AIGC) tools enables images, videos, and visualizations to be created on demand for webpage design, offering a flexible and increasingly adopted paradigm for modern UI/UX. However, directly integrating such tools into automated webpage generation often leads to style inconsistency and poor global coherence, as elements are generated in isolation. We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection. MM-WebAgent jointly optimizes global layout, local multimodal content, and their integration, producing coherent and visually consistent webpages. We further introduce a benchmark for multimodal webpage generation and a multi-level evaluation protocol for systematic assessment. Experiments demonstrate that MM-WebAgent outperforms code-generation and agent-based baselines, especially on multimodal element generation and integration. Code & Data: https://aka.ms/mm-webagent.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.