ArXiv TLDR

StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning

🐦 Tweet
2605.03927

Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker + 1 more

cs.CV

TLDR

StateVLM introduces a novel training strategy with Auxiliary Regression Loss to enhance vision-language models for precise object and state localization in robotics.

Key contributions

  • Introduces Auxiliary Regression Loss (ARL) to adapt VLMs for precise object detection and state localization.
  • Develops StateVLM, a model leveraging ARL for fine-grained object and state perception, including graspable regions.
  • Creates OSAR, a new open-source benchmark for object-state affordance reasoning with 1,172 scenes.
  • ARL improves VLM performance by 1.6% on RefCOCO benchmarks and 5.2% on the OSAR benchmark.

Why it matters

VLMs often struggle with numerical reasoning in robotics. This paper addresses this by enhancing their ability to precisely locate objects and their states, which is crucial for robust robotic manipulation and interaction with the physical world.

Original Abstract

Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1,172 scenes with 7,746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and \mbox{RefCOCOg}) demonstrate that ARL improves model performance by an average of 1.6\% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2\% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.