Proactive Detection of GUI Defects in Multi-Window Scenarios via Multimodal Reasoning
Xinyao Zhang, Rui Wang, Jinhao Cui, Haotian Huang, Wei Xue + 3 more
TLDR
This paper introduces a proactive framework using multimodal LLMs to detect GUI display defects in multi-window mobile scenarios, outperforming existing methods.
Key contributions
- Proactively triggers multi-window states (split-screen, foldable) for early defect detection.
- Uses Set-of-Mark (SoM) and multimodal LLMs with chain-of-thought for defect analysis.
- Constructs a benchmark of 50 real-world Android apps for multi-window GUI defects.
- Outperforms baselines, detecting 40 defect-prone apps with high accuracy and 87.2% F1 for occlusion.
Why it matters
Multi-window mobile interfaces are becoming common, but current defect detection tools struggle with their complexity. This research offers a crucial proactive solution to ensure app quality in these dynamic environments. By leveraging advanced AI, it significantly improves the reliability and user experience of mobile applications.
Original Abstract
Multi-window mobile scenarios, such as split-screen and foldable modes, make GUI display defects more likely by forcing applications to adapt to changing window sizes and dynamic layout reflow. Existing detection techniques are limited in two ways: they are largely passive, analyzing screenshots only after problematic states have been reached, and they are mainly designed for conventional full-screen interfaces, making them less effective in multi-window settings.We propose an end-to-end framework for GUI display defect detection in multi-window mobile scenarios. The framework proactively triggers split-screen, foldable, and window-transition states during app exploration, uses Set-of-Mark (SoM) to align screenshots with widget-level interface elements, and leverages multimodal large language models with chain-of-thought prompting to detect, localize, and explain display defects. We also construct a benchmark of GUI display defects using 50 real-world Android applications.Experimental results show that multi-window settings substantially increase the exposure of layout-related defects, with text truncation increasing by 184% compared with conventional full-screen settings. At the application level, our method detects 40 defect-prone apps with a false positive rate of 10.00% and a false negative rate of 11.11%, outperforming OwlEye and YOLO-based baselines. At the fine-grained level, it achieves the best F1 score of 87.2% for widget occlusion detection.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.