The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
TLDR
Discrete action tokenization creates a "Compression Gap" in VLA models, limiting scaling benefits from improved vision encoders.
Key contributions
- Introduces the "Compression Gap" principle: VLA model scaling is limited by the tightest information bottleneck.
- Shows discrete action tokenization (e.g., OAT) makes the codebook the bottleneck, nullifying vision encoder upgrades.
- Demonstrates continuous actions (e.g., Diffusion Policy) allow vision encoder scaling to improve performance.
- Validates findings on LIBERO, showing codebook capacity directly impacts encoder sensitivity.
Why it matters
This paper challenges the assumption that simply upgrading vision encoders scales Vision-Language-Action models. It highlights the critical role of information bottlenecks, especially discrete action representations, in limiting performance gains. This shifts the focus to pipeline-level bottleneck identification for effective Physical AI scaling.
Original Abstract
Scaling Vision-Language-Action (VLA) models by upgrading the vision encoder is expected to improve downstream manipulation performance--as it does in vision-language modeling. We show that this expectation fails when actions are represented as discrete tokens, and explain why through an information-theoretic principle we call the Compression Gap: in any visuomotor pipeline, scaling behavior is governed by the location of the tightest information bottleneck. When actions are continuous (e.g., Diffusion Policy), the vision encoder is the binding constraint, and upgrading it directly improves performance. When actions are discretized through a fixed-capacity codebook (e.g., OAT), the codebook becomes the binding constraint, and encoder improvements cannot propagate past it--regardless of how rich the upstream representation is. We validate this principle on the LIBERO benchmark with three lines of evidence: a factorial experiment showing that encoder upgrades improve Diffusion Policy by over 21 percentage points while OAT gains are substantially attenuated across model scales; an encoder quality gradient across four encoders confirming that Diffusion Policy tracks encoder quality monotonically while OAT remains flat; and a codebook size experiment demonstrating that relaxing codebook capacity partially recovers encoder sensitivity, providing causal evidence for the bottleneck hypothesis. Our findings reveal that scaling in Physical AI requires identifying where information bottlenecks lie in the pipeline, rather than uniformly increasing model or data size.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.