UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

April 15, 20262604.14113

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong + 6 more

cs.CVcs.AIcs.CL

TLDR

UI-Zoomer adaptively zooms into GUI elements based on prediction uncertainty, improving localization for small icons and dense layouts without retraining.

Key contributions

Introduces UI-Zoomer, a training-free adaptive zoom-in framework for GUI grounding.
Uses a confidence-aware gate to selectively trigger zoom-in when localization is uncertain.
Derives per-instance crop radii using an uncertainty-driven crop sizing module.
Achieves significant accuracy gains (+13.4%, +10.3%, +4.2%) across multiple datasets.

Why it matters

This paper tackles the challenge of localizing small and dense GUI elements, a common issue in interface understanding. By adaptively zooming based on model uncertainty, UI-Zoomer significantly boosts accuracy without requiring any model retraining. This makes existing GUI grounding models more robust and practical.

Original Abstract

GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4\%, +10.3\%, and +4.2\% respectively, with no additional training required.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers