Gaze4HRI: Zero-shot Benchmarking Gaze Estimation Neural-Networks for Human-Robot Interaction
Berk Sezer, Ali Görkem Küçük, Erol Şahin, Sinan Kalkan
TLDR
Gaze4HRI introduces a large-scale benchmark for zero-shot gaze estimation in HRI, revealing current methods' failures and highlighting data diversity as key to robustness.
Key contributions
- Introduces Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos) for HRI gaze estimation.
- Benchmarks state-of-the-art gaze methods against HRI variables like illumination, head-gaze conflict, and motion.
- Reveals all evaluated methods fail in at least one condition, with steeply-downward gaze as a universal failure point.
- Finds extensive data diversity (e.g., ETH-X-Gaze) and frameworks like PureGaze drive zero-shot robustness.
Why it matters
This paper provides a crucial benchmark for zero-shot gaze estimation in HRI, addressing overlooked real-world conditions. Its findings offer practical guidelines for practitioners and redirect future research towards data diversity and robust frameworks over complex models, improving HRI reliability.
Original Abstract
While zero-shot appearance-based 3D gaze estimation offers significant cost-efficiency by directly mapping RGB images to gaze vectors, its reliability in Human-Robot Interaction (HRI) settings remains uncertain. Existing benchmarks frequently overlook fundamental HRI conditions, such as dynamic camera viewpoints and moving targets in video. Furthermore, current cross-dataset evaluations often suffer from a complexity gap, where methods trained on diverse datasets are tested on significantly smaller and less varied sets, failing to assess true robustness. To bridge these gaps, we introduce Gaze4HRI, a large-scale dataset (50+ subjects, 3,000+ videos, 600,000+ frames) designed to evaluate state-of-the-art performance against critical HRI variables: illumination, head-gaze conflict, as well as the motion of camera and gaze target in video. Our benchmark reveals that all evaluated methods fail in at least one condition, identifying steeply-downward gaze as a universal failure point. Notably, PureGaze trained on the ETH-X-Gaze dataset uniquely maintains resilience across all other conditions. These results challenge the recent focus in the literature on complex spatial-temporal modeling and Transformer-based architectures. Instead, our findings suggest that extensive data diversity, as exemplified by the ETH-X-Gaze dataset, serves as the primary driver of zero-shot robustness in unconstrained environments, while resilience-enhancing frameworks, such as PureGaze's self-adversarial loss for gaze feature purification, provide a substantial further improvement. Ultimately, this study establishes a rigorous benchmark that provides practical guidelines for practitioners as well as reshaping future research. The dataset and codes are available at https://gazeforhri.github.io.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.