DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

April 28, 20262604.25914

Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu + 15 more

cs.CL

TLDR

DV-World is a new benchmark for evaluating data visualization agents in real-world scenarios, addressing limitations of existing code-sandbox approaches.

Key contributions

Introduces DV-World, a benchmark with 260 tasks for real-world data visualization agents.
Covers spreadsheet manipulation, visual artifact adaptation, and proactive intent alignment across diverse platforms.
Employs a hybrid evaluation framework combining numerical precision and MLLM-as-a-Judge for semantic-visual assessment.
Shows state-of-the-art models achieve less than 50% performance, exposing critical deficits in real-world DV.

Why it matters

This paper introduces DV-World, a crucial benchmark that addresses the significant gaps in evaluating data visualization agents in realistic, complex scenarios. By moving beyond confined code-sandboxes and single-language tasks, it provides a robust testbed for developing versatile AI expertise needed in enterprise workflows. The findings highlight current SOTA model limitations, guiding future research.

Original Abstract

Real-world data visualization (DV) requires native environmental grounding, cross-platform evolution, and proactive intent alignment. Yet, existing benchmarks often suffer from code-sandbox confinement, single-language creation-only tasks, and assumption of perfect intent. To bridge these gaps, we introduce DV-World, a benchmark of 260 tasks designed to evaluate DV agents across real-world professional lifecycles. DV-World spans three domains: DV-Sheet for native spreadsheet manipulation including chart and dashboard creation as well as diagnostic repair; DV-Evolution for adapting and restructuring reference visual artifacts to fit new data across diverse programming paradigms and DV-Interact for proactive intent alignment with a user simulator that mimics real-world ambiguous requirements. Our hybrid evaluation framework integrates Table-value Alignment for numerical precision and MLLM-as-a-Judge with rubrics for semantic-visual assessment. Experiments reveal that state-of-the-art models achieve less than 50% overall performance, exposing critical deficits in handling the complex challenges of real-world data visualization. DV-World provides a realistic testbed to steer development toward the versatile expertise required in enterprise workflows. Our data and code are available at \href{https://github.com/DA-Open/DV-World}{this project page}.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers