See, Point, Refine: Multi-Turn Approach to GUI Grounding with Visual Feedback

April 14, 20262604.13019

Himangi Mittal, Gaurav Mittal, Nelson Daniel Troncoso, Yu Hu

cs.CV

TLDR

This paper introduces a multi-turn, visual feedback approach for precise GUI grounding in dense coding interfaces, significantly outperforming single-shot methods.

Key contributions

Addresses pixel-precise GUI grounding in dense coding environments, a challenge for Computer Use Agents.
Proposes a multi-turn iterative refinement process using visual feedback for error correction.
Employs a closed-loop grounding mechanism to self-correct displacement errors and adapt to dynamic UI changes.
Achieves significant improvements over single-shot models in click precision and overall task success on coding benchmarks.

Why it matters

This paper significantly advances GUI grounding for Computer Use Agents, especially in complex coding interfaces requiring sub-pixel accuracy. Its multi-turn refinement approach makes agents more reliable and adaptable, paving the way for next-generation software engineering tools.

Original Abstract

Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our approach across GPT-5.4, Claude, and Qwen on a suite of complex coding benchmarks, demonstrating that multi-turn refinement significantly outperforms state-of-the-art single-shot models in both click precision and overall task success rate. Our results suggest that iterative visual reasoning is a critical component for the next generation of reliable software engineering agents. Code: https://github.com/microsoft/precision-cua-bench.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers