Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Zhou Huoshen, Yijia Fan + 12 more
TLDR
This paper introduces CUActSpot, a new benchmark and data synthesis method to improve computer-use agents' reliability on complex, diverse interactions.
Key contributions
- Identifies data scarcity for complex, low-frequency GUI interactions as a key limitation for computer-use agents.
- Introduces CUActSpot, a benchmark covering 5 modalities and diverse actions beyond click-centric GUI interactions.
- Develops a renderer-based data synthesis pipeline to generate diverse scenes, instructions, and action traces.
- Achieves state-of-the-art performance with Phi-Ground-Any-4B, outperforming larger open-source models.
Why it matters
This work addresses a critical bottleneck in computer-use agents: their poor reliability on complex, less common interactions. By providing a comprehensive benchmark and a scalable data synthesis method, it paves the way for more robust and trustworthy automation tools.
Original Abstract
Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.