StarVLA-$α$: Reducing Complexity in Vision-Language-Action Systems
Jinhui Ye, Ning Gao, Senqiao Yang, Jinliang Zheng, Zixuan Wang + 5 more
TLDR
StarVLA-$α$ simplifies Vision-Language-Action systems, offering a strong baseline that reduces complexity and achieves competitive performance across diverse benchmarks.
Key contributions
- Introduces StarVLA-$α$, a simple yet strong baseline for Vision-Language-Action (VLA) systems.
- Deliberately minimizes architectural and pipeline complexity to enable systematic VLA design analysis.
- Re-evaluates key design axes including action modeling, robot pretraining, and interface engineering.
- Achieves competitive performance across four unified benchmarks, outperforming $π_{0.5}$ by 20% on RoboChallenge.
Why it matters
The VLA landscape is highly complex and fragmented. StarVLA-$α$ offers a simplified, strong baseline to systematically study VLA design choices. It demonstrates that a strong VLM backbone with minimal design can achieve competitive performance, reducing reliance on complex architectures and engineering tricks. This provides a clear path for future research.
Original Abstract
Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for building general-purpose robotic agents. However, the VLA landscape remains highly fragmented and complex: as existing approaches vary substantially in architectures, training data, embodiment configurations, and benchmark-specific engineering. In this work, we introduce StarVLA-$α$, a simple yet strong baseline designed to study VLA design choices under controlled conditions. StarVLA-$α$ deliberately minimizes architectural and pipeline complexity to reduce experimental confounders and enable systematic analysis. Specifically, we re-evaluate several key design axes, including action modeling strategies, robot-specific pretraining, and interface engineering. Across unified multi-benchmark training on LIBERO, SimplerEnv, RoboTwin, and RoboCasa, the same simple baseline remains highly competitive, indicating that a strong VLM backbone combined with minimal design is already sufficient to achieve strong performance without relying on additional architectural complexity or engineering tricks. Notably, our single generalist model outperforms $π_{0.5}$ by 20\% on the public real-world RoboChallenge benchmark. We expect StarVLA-$α$ to serve as a solid starting point for future research in the VLA regime. Code will be released at https://github.com/starVLA/starVLA.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.