Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision

April 23, 20262604.21461

Chentao Li, Zirui Gao, Mingze Gao, Yinglian Ren, Jianjiang Feng + 1 more

cs.CVcs.HC

TLDR

This paper introduces EgoPoint-Bench to benchmark and enhance MLLMs' understanding of pointing gestures in egocentric vision, addressing "Referential Hallucination."

Key contributions

Identifies "Referential Hallucination" where MLLMs fail to precisely ground egocentric pointing.
Introduces EgoPoint-Bench, an 11k sample QA benchmark for egocentric multimodal pointing reasoning.
EgoPoint-Bench evaluates 5 dimensions and 3 complexity levels in simulated and real-world data.
Shows fine-tuning on synthetic data significantly improves MLLM pointing and enables robust sim-to-real generalization.

Why it matters

This work addresses a critical gap in egocentric AI, where MLLMs fail to precisely ground pointing gestures. It offers a scalable path toward building more accurate and spatially aware egocentric AI assistants like smart glasses.

Original Abstract

Egocentric AI agents, such as smart glasses, rely on pointing gestures to resolve referential ambiguities in natural language commands. However, despite advancements in Multimodal Large Language Models (MLLMs), current systems often fail to precisely ground the spatial semantics of pointing. Instead, they rely on spurious correlations with visual proximity or object saliency, a phenomenon we term "Referential Hallucination." To address this gap, we introduce EgoPoint-Bench, a comprehensive question-answering benchmark designed to evaluate and enhance multimodal pointing reasoning in egocentric views. Comprising over 11k high-fidelity simulated and real-world samples, the benchmark spans five evaluation dimensions and three levels of referential complexity. Extensive experiments demonstrate that while state-of-the-art proprietary and open-source models struggle with egocentric pointing, models fine-tuned on our synthetic data achieve significant performance gains and robust sim-to-real generalization. This work highlights the importance of spatially aware supervision and offers a scalable path toward precise egocentric AI assistants. Project page: https://guyyyug.github.io/EgoPoint-Bench/

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers