ArXiv TLDR

UIGaze: How Closely Can VLMs Approximate Human Visual Attention on User Interfaces?

🐦 Tweet
2604.26352

Min Song, Yoonseong Lee, Yeonhu Seo

cs.HC

TLDR

UIGaze explores how well VLMs predict human visual attention on UIs using eye-tracking, finding moderate alignment that varies by UI type and viewing duration.

Key contributions

  • Investigates how closely VLMs approximate human visual attention on user interfaces.
  • Utilizes the UEyes dataset of 1,980 UI screenshots with real eye-tracking data from 62 participants.
  • Evaluates nine state-of-the-art VLMs using a zero-shot coordinate prediction pipeline.
  • Finds VLMs achieve moderate alignment with human gaze, improving with longer viewing durations.

Why it matters

This paper is crucial for understanding VLMs' potential in UI/UX design and accessibility. By quantifying how well VLMs predict human gaze, it opens avenues for automated UI evaluation and intelligent design tools. It highlights current limitations and future research directions for more human-centric AI.

Original Abstract

Vision Language Models (VLMs) have demonstrated strong capabilities in understanding visual content, yet their ability to predict where humans look on user interfaces remains unexplored. We present UIGaze, a study investigating how closely VLMs can approximate human visual attention on user interfaces using real eye-tracking data. Using the UEyes dataset - comprising 1,980 UI screenshots across four categories (webpage, desktop, mobile, poster) with eye-tracking data from 62 participants - we evaluate nine state-of-the-art VLMs through a zero-shot coordinate prediction pipeline. Each model generates gaze point coordinates that are converted into saliency maps via Gaussian blurring and compared against ground truth using CC, SIM, and KL divergence. Our experiments (1,980 images x 9 models x 3 runs x 3 durations) reveal that VLMs achieve moderate alignment with human gaze patterns, with the degree of alignment varying significantly across UI types and improving with longer viewing durations - suggesting VLMs capture exploratory gaze patterns rather than initial fixations. All code, predictions, and evaluation results are publicly available.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.