VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

April 21, 20262604.19728

Jean Mercat, Sedrick Keh, Kushal Arora, Isabella Huang, Paarth Shah + 3 more

cs.ROcs.AIcs.CVcs.LGcs.SE

TLDR

VLA Foundry is an open-source framework unifying LLM, VLM, and VLA training, offering end-to-end control from pretraining to fine-tuning.

Key contributions

Introduces VLA Foundry, a unified open-source framework for end-to-end LLM, VLM, and VLA model training.
Supports both from-scratch training and fine-tuning with pretrained Hugging Face backbones.
Releases two models, including one based on Qwen3-VL, achieving strong multi-task manipulation.
Contributes usability improvements to the LBM Eval simulator and STEP analysis tools.

Why it matters

This framework addresses the fragmentation in VLA model development by offering a cohesive, end-to-end training pipeline. It enables researchers to easily train powerful VLA models from scratch or leverage existing backbones, advancing open-source robotics.

Original Abstract

We present VLA Foundry, an open-source framework that unifies LLM, VLM, and VLA training in a single codebase. Most open-source VLA efforts specialize on the action training stage, often stitching together incompatible pretraining pipelines. VLA Foundry instead provides a shared training stack with end-to-end control, from language pretraining to action-expert fine-tuning. VLA Foundry supports both from-scratch training and pretrained backbones from Hugging Face. To demonstrate the utility of our framework, we train and release two types of models: the first trained fully from scratch through our LLM-->VLM-->VLA pipeline and the second built on the pretrained Qwen3-VL backbone. We evaluate closed-loop policy performance of both models on LBM Eval, an open-data, open-source simulator. We also contribute usability improvements to the simulator and the STEP analysis tools for easier public use. In the nominal evaluation setting, our fully-open from-scratch model is on par with our prior closed-source work and substituting in the Qwen3-VL backbone leads to a strong multi-task table top manipulation policy outperforming our baseline by a wide margin. The VLA Foundry codebase is available at https://github.com/TRI-ML/vla_foundry and all multi-task model weights are released on https://huggingface.co/collections/TRI-ML/vla-foundry. Additional qualitative videos are available on the project website https://tri-ml.github.io/vla_foundry.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers