GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

May 12, 20262605.12369

Xiaosong Jia, Bowen Yang, Zuhao Ge, Xian Nie, Yuchen Zhou + 15 more

cs.RO

TLDR

GuidedVLA enhances VLA models by specializing action attention heads with auxiliary signals to focus on task-relevant factors, improving generalization and robustness.

Key contributions

Solves VLA model overfitting by explicitly guiding action decoders to focus on task-relevant factors.
Introduces specialized attention heads, each supervised by auxiliary signals for distinct factors.
Instantiates with heads for object grounding, spatial geometry, and temporal skill logic.
Improves success rates in-domain and out-of-domain across simulation and real-robot experiments.

Why it matters

Existing VLA models struggle with generalization due to overfitting to noise. GuidedVLA offers a novel approach to build more robust and general VLA models by providing explicit guidance. This method yields high-quality, decoupled features, paving the way for more reliable robot learning.

Original Abstract

Vision-Language-Action (VLA) models aim for general robot learning by aligning action as a modality within powerful Vision-Language Models (VLMs). Existing VLAs rely on end-to-end supervision to implicitly enable the action decoding process to learn task-relevant features. However, without explicit guidance, these models often overfit to spurious correlations, such as visual shortcuts or environmental noise, limiting their generalization. In this paper, we introduce GuidedVLA, a framework designed to manually guide the action generation to focus on task-relevant factors. Our core insight is to treat the action decoder not as a monolithic learner, but as an assembly of functional components. Individual attention heads are supervised by manually defined auxiliary signals to capture distinct factors. As an initial study, we instantiate this paradigm with three specialized heads: object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA improves success rates in both in-domain and out-of-domain settings compared to strong VLA baselines. Finally, we show that the quality of these specialized factors correlates positively with task performance and that our mechanism yields decoupled, high-quality features. Our results suggest that explicitly guiding action-decoder learning is a promising direction for building more robust and general VLA models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers