ArXiv TLDR

CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

🐦 Tweet
2605.10903

Wenxuan Song, Han Zhao, Fuhao Li, Ziyang Zhou, Xi Wang + 5 more

cs.CVcs.RO

TLDR

CapVector learns transferable capability vectors in parametric space for VLA models, enhancing performance and reducing adaptation costs during finetuning.

Key contributions

  • Decouples SFT objectives to learn transferable "capability vectors" from parameter differences.
  • Merges learned capability vectors with pretrained parameters to form a capability-enhanced meta model.
  • Achieves strong performance comparable to auxiliary finetuning with reduced computational overhead.
  • Capability vectors are effective, versatile, and generalize to novel environments and embodiments out of the box.

Why it matters

Pretrained VLA models struggle with efficient finetuning. This paper offers a novel, computationally lightweight method to enhance model capabilities and reduce adaptation costs. It enables more effective and versatile VLA models, broadening their applicability across diverse tasks and environments.

Original Abstract

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised finetuning (SFT). Some advanced finetuning methods with auxiliary training objectives can improve performance and reduce the number of convergence steps. However, they typically incur significant computational overhead due to the additional losses from auxiliary objectives. To simultaneously achieve the enhanced capabilities of auxiliary training with the simplicity of standard SFT, we decouple the two objectives of auxiliary-objective SFT within the parameter space, namely, enhancing general capabilities and fitting task-specific action distributions. To deliver the goal, we only need to train the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters' difference between the two models can then be interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pretrained parameters to form a capability-enhanced meta model. Moreover, when standard SFT is augmented with a lightweight orthogonal regularization loss, the merged model attains performance comparable to auxiliary finetuned baselines with reduced computational overhead. Internal and external experiments demonstrate that our capability vectors (1) are effective and versatile across diverse models, (2) can generalize to novel environments and embodiments out of the box.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.