Assessing VLM-Driven Semantic-Affordance Inference for Non-Humanoid Robot Morphologies
Jess Jones, Raul Santos-Rodriguez, Sabine Hauert
TLDR
This paper assesses VLM-driven semantic affordance inference for non-humanoid robots, revealing conservative predictions and inconsistent performance.
Key contributions
- Investigates VLM affordance inference for non-humanoid robots, addressing a critical research gap.
- Introduces a novel hybrid dataset combining real-world and VLM-generated synthetic affordance data.
- Finds VLMs generalize to non-humanoid forms but show inconsistent performance across object domains.
- Identifies a pattern of low false positives but high false negatives, indicating conservative VLM predictions.
Why it matters
VLMs are crucial for robot autonomy, but their applicability to diverse robot morphologies was unclear. This work fills that gap by showing VLMs can generalize but tend to be overly conservative. Understanding these limitations is vital for safely deploying VLMs in real-world robotic systems, especially for novel tasks.
Original Abstract
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding human-object interactions, but their application to robotic systems with non-humanoid morphologies remains largely unexplored. This work investigates whether VLMs can effectively infer affordances for robots with fundamentally different embodiments than humans, addressing a critical gap in the deployment of these models for diverse robotic applications. We introduce a novel hybrid dataset that combines annotated real-world robotic affordance-object relations with VLM-generated synthetic scenarios, and perform an empirical analysis of VLM performance across multiple object categories and robot morphologies, revealing significant variations in affordance inference capabilities. Our experiments demonstrate that while VLMs show promising generalisation to non-humanoid robot forms, their performance is notably inconsistent across different object domains. Critically, we identify a consistent pattern of low false positive rates but high false negative rates across all morphologies and object categories, indicating that VLMs tend toward conservative affordance predictions. Our analysis reveals that this pattern is particularly pronounced for novel tool use scenarios and unconventional object manipulations, suggesting that effective integration of VLMs in robotic systems requires complementary approaches to mitigate over-conservative behaviour while preserving the inherent safety benefits of low false positive rates.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.