ArXiv TLDR

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

🐦 Tweet
2605.12501

Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan + 12 more

cs.CV

TLDR

This paper introduces CUActSpot, a new benchmark and data synthesis method to improve computer-use agents' reliability on complex, diverse interactions.

Key contributions

  • Identifies data scarcity for complex, low-frequency GUI interactions as a key limitation for computer-use agents.
  • Introduces CUActSpot, a benchmark covering 5 modalities and diverse actions beyond click-centric GUI interactions.
  • Develops a renderer-based data synthesis pipeline to generate diverse scenes, instructions, and action traces.
  • Achieves state-of-the-art performance with Phi-Ground-Any-4B, outperforming larger open-source models.

Why it matters

This work addresses a critical bottleneck in computer-use agents: their poor reliability on complex, less common interactions. By providing a comprehensive benchmark and a scalable data synthesis method, it paves the way for more robust and trustworthy automation tools.

Original Abstract

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.