VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

May 7, 20262605.06175

Yuhua Jiang, Junjie Lu, Xinyao Qin, Xiaoyu Chen, Kaixin Wang + 2 more

cs.RO

TLDR

VLA-GSE introduces a novel parameter-efficient fine-tuning framework using generalized and specialized experts to boost VLA model adaptation for robotic control tasks.

Key contributions

Proposes VLA-GSE, a PEFT framework for VLA models, improving adaptation to robotic control tasks.
Uses spectral decomposition to create generalized (shared) and specialized (routed) experts from the frozen backbone.
Achieves 81.2% zero-shot success on LIBERO-Plus, outperforming baselines with only 2.51% parameter updates.
Preserves pre-trained VLM capabilities and improves real-world manipulation under distribution shifts.

Why it matters

This paper addresses the critical challenge of adapting large VLA models to robotic control without overfitting or forgetting pre-trained knowledge. VLA-GSE offers a highly parameter-efficient solution that significantly boosts performance on complex manipulation tasks while preserving core VLM capabilities. Its novel expert decomposition approach sets a new standard for efficient VLA fine-tuning.

Original Abstract

Vision-language-action (VLA) models inherit rich visual-semantic priors from pre-trained vision-language backbones, but adapting them to robotic control remains challenging. Full fine-tuning (FFT) is prone to overfitting on downstream robotic data and catastrophic forgetting of pretrained vision-language capabilities. Parameter-efficient fine-tuning (PEFT) better preserves pre-trained knowledge, yet existing PEFT methods still struggle to adapt effectively to robot control tasks. To address this gap, we propose VLA-GSE, a parameter-efficient VLA fine-tuning framework that improves control adaptation while retaining PEFT's knowledge preservation advantage. Specifically, VLA-GSE (Generalized and Specialized Experts) is initialized by spectrally decomposing the frozen backbone, assigning leading singular components to generalized experts (shared experts) and disjoint residual components to specialized experts (routed experts). This decomposition improves adaptation capacity under a fixed trainable-parameter budget. Under a comparable parameter budget, VLA-GSE updates only 2.51% of the full model parameters and consistently outperforms strong FFT and PEFT baselines. It achieves 81.2% average zero-shot success on LIBERO-Plus, preserves pre-trained VLM capability comparably to LoRA on multimodal understanding benchmarks, and improves real-world manipulation success under multiple distribution shifts. Code is available at: https://github.com/YuhuaJiang2002/VLA-GSE

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers