Beyond Screenshots: Evaluating VLMs' Understanding of UI Animations

April 28, 20262604.26148

Chen Liang, Xirui Jiang, Naihao Deng, Eytan Adar, Anhong Guo

cs.HCcs.CL

TLDR

This paper evaluates VLMs' understanding of UI animations using a new dataset, finding they detect motion but struggle with high-level interpretation.

Key contributions

Highlights the research gap in VLM UI understanding, which primarily focuses on static screenshots.
Introduced AniMINT, a novel dataset of 300 densely annotated UI animation videos.
Evaluated state-of-the-art VLMs on animation perception, purpose, and meaning.
VLMs reliably detect primitive motion but show inconsistent high-level animation interpretation, lagging human performance.

Why it matters

Understanding UI animations is crucial for AI agents to reliably interact with modern interfaces, which increasingly use animations for functional purposes. This work reveals current VLMs' limitations in high-level animation interpretation, guiding future research to build more capable and robust UI agents.

Original Abstract

AI agents operating on user interfaces must understand how interfaces communicate state and feedback to act reliably. As a core communicative modality, animations are increasingly used in modern interfaces, serving critical functional purposes beyond mere aesthetics. Thus, understanding UI animation is essential for comprehensive interface interpretation. However, recent studies of Vision Language Models (VLMs) for UI understanding have focused primarily on static screenshots, leaving it unclear how well these models handle dynamic UI animations. To address this gap, we created AniMINT, a novel dataset of 300 densely annotated UI animation videos. We systematically evaluate state-of-the-art VLMs on UI animation understanding, including their abilities to perceive the animation effects, identify animation purposes, and interpret animation meaning. Our results show that VLMs can reliably detect primitive motion. However, their high-level animation interpretation remains inconsistent, with substantial gaps relative to human performance. Finally, we use Motion, Context, and Perceptual Cues (MCPC) to probe factors affecting VLM performance, revealing key bottlenecks and directions for future improvement.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers