ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
Xuhao Hu, Xi Zhang, Haiyang Xu, Kyle Qiao, Jingyi Yang + 4 more
TLDR
ToolCUA enables Computer Use Agents to optimally orchestrate GUI actions and high-level tools using a staged training paradigm, achieving new SOTA.
Key contributions
- Introduces a pipeline to synthesize diverse GUI-Tool trajectories from existing GUI data.
- Employs Tool-Bootstrapped GUI RFT to improve GUI-Tool switching decisions at critical points.
- Optimizes with Online Agentic RL and a Tool-Efficient Path Reward in a high-fidelity environment.
- Achieves 46.85% accuracy on OSWorld-MCP, a 66% relative improvement over baselines.
Why it matters
This paper tackles the critical problem of hybrid action space orchestration for Computer Use Agents. ToolCUA's novel staged training paradigm significantly boosts agent performance, setting a new state of the art. This advancement is crucial for developing more efficient and robust real-world digital agents.
Original Abstract
Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with GUI actions or switch to tools, leading to suboptimal execution paths. This difficulty stems from the scarcity of high-quality interleaved GUI-Tool trajectories, the cost and brittleness of collecting real tool trajectories, and the lack of trajectory-level supervision for GUI-Tool path selection. In this paper, we propose ToolCUA, an end-to-end agent designed to learn optimal GUI-Tool path selection through a staged training paradigm. We first introduce an Interleaved GUI-Tool Trajectory Scaling Pipeline that repurposes abundant static GUI trajectories and synthesizes a grounded tool library, enabling diverse GUI-Tool trajectories without manual engineering or real tool-trajectory collection. We then perform Tool-Bootstrapped GUI RFT, combining warmup SFT with single-turn RL to improve decisions at critical GUI-Tool switching points. Finally, we optimize ToolCUA with Online Agentic RL in a high-fidelity GUI-Tool environment, guided by a Tool-Efficient Path Reward that encourages appropriate tool use and shorter execution paths. Experiments on OSWorld-MCP show that ToolCUA achieves 46.85% accuracy, a relative improvement of approximately 66% over the baseline, establishing a new state of the art among models of comparable scale. It also improves by 3.9% over GUI-only settings, demonstrating effective GUI-Tool orchestration. The results further suggest that training in a hybrid action space is a promising paradigm for real-world digital agents. Open-sourced here: https://x-plug.github.io/ToolCUA/
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.