ArXiv TLDR

Ego-Grounding for Personalized Question-Answering in Egocentric Videos

🐦 Tweet
2604.01966

Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao

cs.CVcs.AIcs.RO

TLDR

This paper introduces MyEgo, a new egocentric video QA dataset, revealing that current MLLMs struggle with personalized ego-grounding and long-term memory.

Key contributions

  • Introduces MyEgo, the first egocentric VideoQA dataset for personalized questions about the camera-wearer.
  • Benchmarks show MLLMs (GPT-5, Qwen3-VL) achieve low accuracy (36-46%), significantly trailing human performance.
  • Findings highlight MLLMs' limitations in ego-grounding, long-range memory, and tracking "my past" in videos.
  • Neither explicit reasoning nor model scaling consistently improve performance on personalized QA tasks.

Why it matters

Personalized assistance in egocentric videos requires understanding the camera-wearer's perspective and history. This work exposes critical weaknesses in current MLLMs regarding ego-grounding and long-term memory, providing a crucial benchmark and direction for future research.

Original Abstract

We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs' ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about "my things", "my activities", and "my past". Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering "me" and "my past". These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.