Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models

April 20, 20262604.18000

Haiweng Xu, Sipeng Zheng, Hao Luo, Wanpeng Zhang, Ziheng Xi + 1 more

cs.RO

TLDR

This paper introduces BeTTER, a new benchmark revealing that state-of-the-art VLA models lack true embodied reasoning and fail in dynamic scenarios due to architectural flaws.

Key contributions

Introduces BeTTER, a diagnostic benchmark for testing true embodied reasoning in robotic policies.
Applies targeted causal interventions to decouple high-level reasoning failures from low-level execution limits.
Reveals state-of-the-art VLAs catastrophically fail in dynamic scenarios due to architectural bottlenecks.
Identifies lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse as key issues.

Why it matters

Current VLA benchmarks may not accurately assess true embodied reasoning, leading to an overestimation of model capabilities. This work provides a critical diagnostic tool and uncovers fundamental architectural limitations, pushing the field towards more robust and genuinely intelligent robotic systems.

Original Abstract

Recent Vision-Language-Action (VLA) models report impressive success rates on standard robotic benchmarks, fueling optimism about general-purpose physical intelligence. However, recent evidence suggests a systematic misalignment between standard benchmark success and true embodied reasoning, raising the question of whether these high scores reflect genuine cognitive capability. To address this gap, we introduce BeTTER, a diagnostic Benchmark for Testing True Embodied Reasoning in robotic policies. BeTTER applies targeted causal interventions (e.g., spatial layout shifts, temporal extrapolation) while enforcing kinematic isolation to explicitly decouple high-level reasoning failures from low-level execution limits. Through systematic evaluation, we reveal that state-of-the-art VLAs catastrophically fail in dynamic scenarios, exhibiting severe lexical-kinematic shortcuts, behavioral inertia, and semantic feature collapse. Crucially, our mechanistic analysis traces these symptoms to fundamental architectural bottlenecks - such as capacity compression and myopic downsampling - which systematically degrade the model's foundational semantic representation. We demonstrate that highly static evaluation protocols effectively mask this degradation by allowing optimization to overfit to sensorimotor priors. Supported by real-world robotic validation, our findings confirm that this representational breakdown is not a simulation artifact, highlighting the critical need for future VLA paradigms to resolve the structural tension between high-frequency control and high-level reasoning.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers