Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

April 21, 20262604.19667

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao + 2 more

cs.CLcs.AIcs.CVcs.LGcs.MA

TLDR

Introduces Chat2Workflow, a benchmark and agentic framework for generating executable visual workflows from natural language, highlighting LLM limitations.

Key contributions

Introduces Chat2Workflow, a benchmark for generating executable visual workflows from natural language.
Built from real-world business workflows, deployable to platforms like Dify and Coze.
Proposes an agentic framework to mitigate recurrent execution errors in generated workflows.
Reveals LLMs struggle with complex, stable, and executable workflow generation, despite high-level intent.

Why it matters

This paper addresses the costly manual process of building visual workflows by exploring LLM automation. It provides a crucial benchmark and framework, exposing current LLM limitations in generating robust, executable workflows. This work lays a foundation for advancing industrial-grade automation.

Original Abstract

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers