ArXiv TLDR

VEGA: Visual Encoder Grounding Alignment for Spatially-Aware Vision-Language-Action Models

🐦 Tweet
2605.10485

Hao Wang, Xiaobao Wei, Jingyang He, Chengyu Bai, Chun-Kai Fan + 8 more

cs.RO

TLDR

VEGA enhances VLA models' spatial reasoning by directly aligning their visual encoder outputs with 3D-aware features, improving robotic manipulation.

Key contributions

  • Introduces VEGA, aligning VLA visual encoders with 3D-aware DINOv2-FiT3D features for better spatial reasoning.
  • Performs spatial grounding at the visual encoder output, preventing entanglement with linguistic semantics.
  • Employs a lightweight projector for alignment, discarded at inference for zero computational overhead.
  • Establishes new state-of-the-art in implicit spatial grounding for vision-language-action models.

Why it matters

Robotic manipulation requires precise spatial reasoning, which current VLA models often lack due to 2D pretraining. VEGA provides a principled, efficient method to integrate 3D spatial awareness directly into VLA visual encoders. This significantly boosts performance in complex manipulation tasks without adding inference cost.

Original Abstract

Precise spatial reasoning is fundamental to robotic manipulation, yet the visual backbones of current vision-language-action (VLA) models are predominantly pretrained on 2D image data without explicit 3D geometric supervision, resulting in representations that lack accurate spatial awareness. Existing implicit spatial grounding methods partially address this by aligning VLA features with those of 3D-aware foundation models, but they rely on empirical layer search and perform alignment on LLM-level visual tokens where spatial structure has already been entangled with linguistic semantics, limiting both generalizability and geometric interpretability. We propose VEGA (Visual Encoder Grounding Alignment), a simple yet effective framework that directly aligns the output of the VLA's visual encoder with spatially-aware features from DINOv2-FiT3D, a DINOv2 model fine-tuned with multi-view consistent 3D Gaussian Splatting supervision. By performing alignment at the visual encoder output level, VEGA grounds spatial awareness before any linguistic entanglement occurs, offering a more interpretable and principled alignment target. The alignment is implemented via a lightweight projector trained with a cosine similarity loss alongside the standard action prediction objective, and is discarded at inference time, introducing no additional computational overhead. Extensive experiments on simulation benchmark and real-world manipulation tasks demonstrate that VEGA consistently outperforms existing implicit spatial grounding baselines, establishing a new state-of-the-art among implicit spatial grounding methods for VLA models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.