SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

April 21, 20262604.19710

Zewei Zhou, Ruining Yang, Xuewei, Qi, Yiluan Guo + 6 more

cs.CV

TLDR

SpanVLA is an autonomous driving framework that combines efficient action planning with learning from negative-recovery samples to improve robustness and reduce latency.

Key contributions

Integrates autoregressive reasoning with a flow-matching action expert for efficient end-to-end autonomous driving.
Introduces an efficient bridge for VLM guidance to plan trajectories using a flow-matching policy, reducing inference time.
Proposes a GRPO-based post-training method to learn from both positive and negative-recovery driving samples.
Introduces mReasoning, a new real-world driving dataset with complex, reasoning-demanding, and negative-recovery scenarios.

Why it matters

Existing VLA models struggle with high latency and limited robustness. SpanVLA addresses these issues by speeding up action generation and improving robustness through learning from negative behaviors. This enhances autonomous driving safety and efficiency.

Original Abstract

Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers