ArXiv TLDR

IntenBot: Flexible and Imprecise Multimodal Input for LLMs to Understand User Intentions for Casual and Human-Like HRI

🐦 Tweet
2605.04585

Yen-Ting Liu, Chiu-Hsuan Wang, TzuLing Chen, Ting-Ying Lee, Tzu-Hua Wang + 3 more

cs.HC

TLDR

IntenBot uses LLMs to interpret flexible, imprecise multimodal input (voice, gaze, pointing) for more human-like robot interaction in XR.

Key contributions

  • Interprets user intent from flexible, imprecise multimodal input (voice, gaze, pointing) in XR.
  • Leverages LLMs to disambiguate and filter noisy multimodal data for instruction generation.
  • Enables casual, human-like HRI, reducing user effort and attention.
  • Validated via user behavior studies, XR evaluation, and physical robot deployment.

Why it matters

This paper addresses the challenge of enabling human-like, casual interaction with robots, similar to human-to-human communication. By using LLMs to handle flexible and imprecise multimodal input, IntenBot makes HRI more natural and efficient, paving the way for intuitive robot interfaces.

Original Abstract

In natural human-to-human communication, multimodal user input is typically used to supplement explicit and complement implicit voice commands, with casualness allowing for flexible input modality combinations and tolerance for imprecise input data. For example, saying "I want that." with a casual glance at a bottle of water is clear enough in human-to-human communication as an implicit voice command accompanied by gaze and/or gestures, rather than an explicit one. To enable such a human-like interaction in human-robot interaction (HRI), we propose a system, IntenBot, to understand user intentions from flexible and imprecise multimodal input, including voice, gaze, and finger-pointing, in XR. The disambiguation capability of large language models (LLMs) is used to filter out irrelevant input modalities and imprecise input data, generating potential instructions for user confirmation. The flexible and imprecise multimodal input enables casual, human-like interaction with robots, reducing time, effort, and attention, and could also be used as non-voice input. We conducted an informative user behavior study in a simulated environment to understand users' natural be- havior in flexibly interacting with a robot using multimodal input and to obtain appropriate angle range parameters for gaze and finger-pointing. An XR study was then performed to evaluate the performance of IntenBot, compared with other methods. We also deployed IntenBot on a physical robot to showcase its real-world applications.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.