PokeVLA: Empowering Pocket-Sized Vision-Language-Action Model with Comprehensive World Knowledge Guidance

April 22, 20262604.20834

Yupeng Zheng, Xiang Li, Songen Gu, Yuhang Zheng, Shuai Tian + 10 more

cs.RO

TLDR

PokeVLA is a lightweight Vision-Language-Action model that improves robot manipulation by integrating comprehensive world knowledge and spatial awareness.

Key contributions

Introduces PokeVLA, a lightweight Vision-Language-Action model for efficient embodied robot manipulation.
Employs a two-stage training: pre-trains PokeVLM on 2.4M multimodal samples for spatial and embodied reasoning.
Injects manipulation-relevant representations using multi-view goal-aware semantics, geometry alignment, and an action expert.
Achieves state-of-the-art performance on LIBERO-Plus and robust real-world robot deployment.

Why it matters

This paper addresses key limitations in current VLA models by proposing PokeVLA, a lightweight and efficient solution. It significantly improves robot manipulation through enhanced world knowledge and spatial awareness, leading to more robust real-world applications.

Original Abstract

Recent advances in Vision-Language-Action (VLA) models have opened new avenues for robot manipulation, yet existing methods exhibit limited efficiency and a lack of high-level knowledge and spatial awareness. To address these challenges, we propose PokeVLA, a lightweight yet powerful foundation model for embodied manipulation that effectively infuses vision-language understanding into action learning. Our framework introduces a two-stage training paradigm: first, we pre-train a compact vision-language model (PokeVLM) on a curated multimodal dataset of 2.4M samples encompassing spatial grounding, affordance, and embodied reasoning tasks; second, we inject manipulation-relevant representations into the action space through multi-view goal-aware semantics learning, geometry alignment, and a novel action expert. Extensive experiments demonstrate state-of-the-art performance on the LIBERO-Plus benchmark and in real-world deployment, outperforming comparable baselines in success rate and robustness under diverse perturbations. To foster reproducibility and community progress, we will open-source our code, model weights, and the scripts for the curated pre-training dataset. Project page: https://getterupper.github.io/PokeVLA

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers