ArXiv TLDR

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models

🐦 Tweet
2605.13632

Yiran Ling, Qing Lian, Jinghang Li, Qing Jiang, Tianming Zhang + 4 more

cs.ROcs.CV

TLDR

GTA-VLA is an interactive Vision-Language-Action framework that uses user-provided spatial guidance to improve robot reasoning and robustness in embodied tasks.

Key contributions

  • Introduces GTA-VLA, an interactive VLA framework enabling user spatial guidance for robot policies.
  • Generates a unified spatial-visual Chain-of-Thought, integrating human guidance with task planning.
  • Addresses brittleness and out-of-domain shifts in existing Sense-to-Act VLA models.
  • Achieves SOTA (81.2%) on SimplerEnv WidowX and significantly improves OOD task success.

Why it matters

This paper matters because it tackles the brittleness of current VLA models under out-of-domain conditions by allowing human spatial guidance. This interactive approach significantly improves failure recovery and aligns human visual intent with autonomous decision-making, making robots more robust and controllable.

Original Abstract

In this paper, we propose GTA-VLA(Guide, Think, Act), an interactive Vision-Language-Action (VLA) framework that enables spatially steerable embodied reasoning by allowing users to guide robot policies with explicit visual cues. Existing VLA models learn a direct "Sense-to-Act" mapping from multimodal observations to robot actions. While effective within the training distribution, such tightly coupled policies are brittle under out-of-domain (OOD) shifts and difficult to correct when failures occur. Although recent embodied Chain-of-Thought (CoT) approaches expose intermediate reasoning, they still lack a mechanism for incorporating human spatial guidance, limiting their ability to resolve visual ambiguities or recover from mistakes. To address this gap, our framework allows users to optionally guide the policy with spatial priors, such as affordance points, boxes, and traces, which the subsequent reasoning process can directly condition on. Based on these inputs, the model generates a unified spatial-visual Chain-of-Thought that integrates external guidance with internal task planning, aligning human visual intent with autonomous decision-making. For practical deployment, we further couple the reasoning module with a lightweight reactive action head for efficient action execution. Extensive experiments demonstrate the effectiveness of our approach. On the in-domain SimplerEnv WidowX benchmark, our framework achieves a state-of-the-art 81.2% success rate. Under OOD visual shifts and spatial ambiguities, a single visual interaction substantially improves task success over existing methods, highlighting the value of interactive reasoning for failure recovery in embodied control. Details of the project can be found here: https://signalispupupu.github.io/GTA-VLA_ProjPage/

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.