ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models
Zuojin Tang, Haoyun Liu, Xinyuan Chang, Changjie Wu, Dongjie Huo + 9 more
TLDR
ALAM learns algebraically consistent latent transitions from action-free videos, significantly boosting VLA policy performance on complex robot manipulation tasks.
Key contributions
- Introduces ALAM, an Algebraic Latent Action Model, learning structured latent transitions from action-free videos.
- Uses composition and reversal consistency to regularize latent space, ensuring locally additive transitions.
- Integrates structured latent transitions with flow-based policy generation via a joint flow-matching objective.
- Achieves significant performance gains (up to 85% success on MT50) on robot manipulation benchmarks and real-world tasks.
Why it matters
VLA models struggle with limited action-labeled data. ALAM addresses this by leveraging abundant action-free videos to learn robust, algebraically consistent latent action representations. This significantly improves robot policy generation, making VLA models more effective and data-efficient for complex manipulation tasks.
Original Abstract
Vision-language-action (VLA) models remain constrained by the scarcity of action-labeled robot data, whereas action-free videos provide abundant evidence of how the physical world changes. Latent action models offer a promising way to extract such priors from videos, but reconstruction-trained latent codes are not necessarily suitable for policy generation: they may predict future observations while lacking the structure needed to be reused or generated coherently with robot actions. We introduce ALAM (Algebraic Latent Action Model), an Algebraically Consistent Latent Action Model that turns temporal relations in action-free video into structural supervision. Given frame triplets, ALAM learns latent transitions that are grounded by reconstruction while being regularized by composition and reversal consistency, encouraging a locally additive transition space. For downstream VLA learning, we freeze the pretrained encoder and use its latent transition sequences as auxiliary generative targets, co-generated with robot actions under a joint flow-matching objective. This couples structured latent transitions with flow-based policy generation, allowing the policy to exploit ALAM's locally consistent transition geometry without requiring latent-to-action decoding. Representation probes show that ALAM reduces additivity and reversibility errors by 25-85 times over unstructured latent-action baselines and improves long-horizon cumulative reconstruction. When transferred to VLA policies, ALAM raises the average success rate from 47.9% to 85.0% on MetaWorld MT50 and from 94.1% to 98.1% on LIBERO, with consistent gains on real-world manipulation tasks. Ablations further confirm that the strongest improvements arise from the synergy between algebraically structured latent transitions and joint flow matching.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.