StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems

April 13, 20262604.11757

Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang + 5 more

cs.ROcs.AIcs.CV

TLDR

StarVLA-$α$ simplifies Vision-Language-Action systems, offering a strong baseline that reduces complexity and achieves competitive performance across diverse benchmarks.

Key contributions

Introduces StarVLA-$α$, a simple yet strong baseline for Vision-Language-Action (VLA) systems.
Deliberately minimizes architectural and pipeline complexity to enable systematic VLA design analysis.
Re-evaluates key design axes including action modeling, robot pretraining, and interface engineering.
Achieves competitive performance across four unified benchmarks, outperforming $π_{0.5}$ by 20% on RoboChallenge.

Why it matters

The VLA landscape is highly complex and fragmented. StarVLA-$α$ offers a simplified, strong baseline to systematically study VLA design choices. It demonstrates that a strong VLM backbone with minimal design can achieve competitive performance, reducing reliance on complex architectures and engineering tricks. This provides a clear path for future research.

Original Abstract

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers