ArXiv TLDR

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

🐦 Tweet
2604.17817

Shiquan Zhang, Tianyi Zhang, Le Fang, Simon D'Alfonso, Hong Jia + 1 more

cs.HCcs.AIcs.MA

TLDR

This paper introduces DailyDroid, a benchmark for LLM-driven smartphone automation, comparing text-only vs. multimodal inputs and analyzing common failure modes.

Key contributions

  • Created DailyDroid, a benchmark of 75 tasks across 25 Android apps for LLM smartphone automation.
  • Compared text-only vs. multimodal (screenshot) inputs for LLM agents, finding similar performance.
  • Revealed critical failure points in UI accessibility, input modalities, and LLM/app design.

Why it matters

This paper is crucial for advancing LLM-driven smartphone automation by providing a much-needed benchmark and detailed failure analysis. Its findings offer actionable insights for developers to improve mobile agents, application design, and UI accessibility, paving the way for more robust and reliable automation.

Original Abstract

With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI accessibility, input modalities, and LLM/app design, offering implications for future mobile agents, applications, and UI development.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.