ArXiv TLDR

CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

🐦 Tweet
2604.24622

Fan Du, Feng Yan, Jianxiong Wu, Xinrun Xu, Weiye Zhang + 4 more

cs.CVcs.AI

TLDR

CF-VLA introduces a coarse-to-fine two-stage approach for efficient vision-language-action generation, significantly improving real-time robot performance.

Key contributions

  • Proposes CF-VLA, a coarse-to-fine two-stage approach for efficient VLA action generation.
  • Coarse stage initializes actions from noise; fine stage refines them in a single step.
  • Reduces action sampling latency by 75.4% and achieves 83.0% real-robot success rate.
  • Outperforms existing NFE=2 methods and matches NFE=10 baselines in efficiency.

Why it matters

Flow-based VLA policies are expressive but inefficient due to multi-step inference. CF-VLA addresses this by restructuring action generation, enabling faster and more reliable robot control. This is crucial for real-time applications where efficiency and performance are paramount.

Original Abstract

Flow-based vision-language-action (VLA) policies offer strong expressivity for action generation, but suffer from a fundamental inefficiency: multi-step inference is required to recover action structure from uninformative Gaussian noise, leading to a poor efficiency-quality trade-off under real-time constraints. We address this issue by rethinking the role of the starting point in generative action modeling. Instead of shortening the sampling trajectory, we propose CF-VLA, a coarse-to-fine two-stage formulation that restructures action generation into a coarse initialization step that constructs an action-aware starting point, followed by a single-step local refinement that corrects residual errors. Concretely, the coarse stage learns a conditional posterior over endpoint velocity to transform Gaussian noise into a structured initialization, while the fine stage performs a fixed-time refinement from this initialization. To stabilize training, we introduce a stepwise strategy that first learns a controlled coarse predictor and then performs joint optimization. Experiments on CALVIN and LIBERO show that our method establishes a strong efficiency-performance frontier under low-NFE (Number of Function Evaluations) regimes: it consistently outperforms existing NFE=2 methods, matches or surpasses the NFE=10 $π_{0.5}$ baseline on several metrics, reduces action sampling latency by 75.4\%, and achieves the best average real-robot success rate of 83.0\%, outperforming MIP by 19.5 points and $π_{0.5}$ by 4.0 points. These results suggest that structured, coarse-to-fine generation enables both strong performance and efficient inference. Our code is available at https://github.com/EmbodiedAI-RoboTron/CF-VLA.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.