From Pixels to Tokens: A Systematic Study of Latent Action Supervision for Vision-Language-Action Models

May 6, 20262605.04678

Yihan Lin, Haoyang Li, Yang Li, Haitao Shen, Yihan Zhao + 2 more

cs.ROcs.CV

TLDR

This paper systematically compares latent action supervision methods for VLA models, finding image-based actions aid reasoning and action-based actions improve motor skills.

Key contributions

Systematically compares image-based and action-based latent action supervision for VLA models.
Reveals image-based actions benefit long-horizon reasoning and scene-level generalization.
Shows action-based actions excel at complex motor coordination and discrete token supervision is most effective.
Provides initial insights into latent action benefits for mixed-data VLA training.

Why it matters

This work provides a much-needed systematic comparison of latent action supervision, clarifying which methods are best for different VLA tasks. Its findings on discrete token supervision and mixed-data training offer promising directions for developing more robust and versatile VLA models.

Original Abstract

Latent actions serve as an intermediate representation that enables consistent modeling of vision-language-action (VLA) models across heterogeneous datasets. However, approaches to supervising VLAs with latent actions are fragmented and lack a systematic comparison. This work structures the study of latent action supervision from two perspectives: (i) regularizing the trajectory via image-based latent actions, and (ii) unifying the target space with action-based latent actions. Under a unified VLA baseline, we instantiate and compare four representative integration strategies. Our results reveal a formulation-task correspondence: image-based latent actions benefit long-horizon reasoning and scene-level generalization, whereas action-based latent actions excel at complex motor coordination. Furthermore, we find that directly supervising the VLM with discrete latent action tokens yields the most effective performance. Finally, our experiments offer initial insights into the benefits of latent action supervision in mixed-data, suggesting a promising direction for VLA training. Code is available at https://github.com/RUCKBReasoning/From_Pixels_to_Tokens.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers