ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

May 6, 20262605.05126

Wei Li, Jizhihui Liu, Li Yixing, Junwen Tong, Rui Shao + 1 more

cs.RO

TLDR

ConsisVLA-4D enhances robotic manipulation by improving spatiotemporal consistency in 3D perception and 4D reasoning, achieving significant speedups.

Key contributions

CV-Aligner ensures cross-view object semantic consistency by aligning instruction-relevant regions and object identities.
CO-Fuser guarantees cross-object spatial geometric consistency, resolving ambiguities with compact latent representations.
CS-Thinker achieves cross-scene spatiotemporal consistency, learning local dynamics and global depth for efficient reasoning.
Achieves 21.6% and 41.5% performance gains with 2.3x and 2.4x inference speedups over OpenVLA.

Why it matters

Current VLA models struggle with spatiotemporal perception and reasoning due to 2D focus and computational overhead. ConsisVLA-4D offers an efficient, unified framework to overcome these limitations. This leads to substantial performance and speed improvements in robotic manipulation tasks.

Original Abstract

Current Vision-Language-Action (VLA) models primarily focus on mapping 2D observations to actions, but exhibit notable limitations in spatiotemporal perception and reasoning: 1) spatial representations often rely on additional sensors, introducing substantial computational overhead; 2) visual reasoning is typically limited to future-frame prediction, lacking alignment with the instruction-grounded scene and thus compromising spatiotemporal consistency. To address these challenges, we propose ConsisVLA-4D, a unified and efficient framework that enhances spatiotemporal consistency in 3D perception and 4D reasoning. Specifically, we design: 1) CV-Aligner, which ensures cross-view object semantic consistency by filtering instruction-relevant regions and aligning object identities across multiple viewpoints; 2) CO-Fuser, which guarantees cross-object spatial geometric consistency by eliminating spatial relation ambiguities between objects across views using compact latent representations. Building upon these, we introduce 3) CS-Thinker to achieve cross-scene spatiotemporal consistency as actions unfold. It learns implicit knowledge of local dynamics from object-semantic tokens of CV-Aligner and global depth from geometric tokens of CO-Fuser, thereby enhancing efficient visual reasoning under scene variations. Extensive experiments demonstrate that, benefiting from its efficient spatiotemporal consistency design, ConsisVLA-4D achieves 21.6% and 41.5% performance improvements, along with 2.3-fold and 2.4-fold inference speedups compared to OpenVLA on the LIBERO benchmark and real-world platforms, respectively.ConsisVLA-4D is open-sourced and publicly available at

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers