ArXiv TLDR

GazeVLA: Learning Human Intention for Robotic Manipulation

🐦 Tweet
2604.22615

Chengyang Li, Kaiyi Xiong, Yuan Xu, Lei Qian, Yizhou Wang + 1 more

cs.RO

TLDR

GazeVLA uses human gaze as an intention proxy to bridge the human-robot embodiment gap, improving robotic manipulation with less robot data.

Key contributions

  • Introduces GazeVLA, a framework learning human intention from gaze to bridge the human-robot embodiment gap.
  • Pretrains on large egocentric human datasets to capture gaze-action synergy, then finetunes on minimal robot data.
  • Employs a Chain-of-Thought reasoning paradigm, predicting intention before executing robotic actions.
  • Achieves state-of-the-art performance, better generalization, and robustness in diverse manipulation tasks.

Why it matters

This paper addresses a key challenge in robotics: reducing reliance on extensive robot demonstrations by leveraging human data. By explicitly modeling human intention through gaze, GazeVLA offers a novel approach to transfer human knowledge more effectively. This could significantly accelerate the development of more capable and data-efficient robotic systems.

Original Abstract

Embodied foundation models have achieved significant breakthroughs in robotic manipulation, yet they still depend heavily on large-scale robot demonstrations. Although recent works have explored leveraging human data to alleviate this dependency, effectively extracting transferable knowledge remains a significant challenge due to the inherent embodiment gap between human and robot. We argue that the intention underlying human actions can serve as a powerful intermediate representation for bridging this gap. In this paper, we introduce a novel framework that explicitly learns and transfers human intention to facilitate robotic manipulation. Specifically, we model intention through gaze, as it naturally precedes physical actions and serves as an observable proxy for human intent. Our model is first pretrained on a large-scale egocentric human dataset to capture human intention and its synergy with action, followed by finetuning on a small set of robot and human data. During inference, the model adopts a Chain-of-Thought reasoning paradigm, sequentially predicting intention before executing the action. Extensive evaluations in simulation and real-world settings, across long-horizon and fine-grained tasks, and under few-shot and robustness benchmarks, show that our method consistently outperforms strong baselines, generalizes better, and achieves state-of-the-art performance.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.