IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models
Hamed Rahimi, Clemence Grislain, Adrien Jacquet Cretides, Olivier Sigaud, Mohamed Chetouani
TLDR
IntentVLM is a novel video-language framework that uses forward-inverse modeling to achieve state-of-the-art open-vocabulary human intention recognition.
Key contributions
- Introduces IntentVLM, a two-stage video-language framework for open-vocabulary intention recognition.
- Decomposes intention understanding into goal candidate generation and structured inference, inspired by cognitive science.
- Achieves state-of-the-art accuracy (up to 80%) on IntentQA and Inst-IT Bench, outperforming baselines by 30%.
- Enhances open-vocabulary understanding without catastrophic forgetting, providing a robust foundation for robotics.
Why it matters
This paper introduces a significant advancement in human-robot interaction by enabling robots to accurately infer human intentions in complex, open-vocabulary settings. Its novel two-stage, forward-inverse modeling approach effectively reduces reasoning hallucinations. The impressive state-of-the-art results, matching human performance, pave the way for more intuitive and effective human-centered robotics.
Original Abstract
Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.