From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

April 23, 20262604.21391

Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang + 3 more

cs.ROcs.AI

TLDR

ResVLA refines robot control by decoupling global intent from local dynamics using a residual diffusion bridge, improving efficiency and robustness.

Key contributions

ResVLA shifts from "Generation-from-Noise" to "Refinement-from-Intent" for VLA policies.
Decomposes robot control into a low-frequency intent anchor and high-frequency residual dynamics.
Uses spectral analysis to decouple control and a residual diffusion bridge for refining local dynamics.
Demonstrates competitive performance, faster convergence, and strong robustness in both simulation and real-world.

Why it matters

This paper introduces ResVLA, a novel approach addressing the challenge of bridging high-level cognition with low-level robot control. By refining from intent, it offers a more efficient and robust method for embodied intelligence, potentially leading to more reliable and adaptable robotic systems.

Original Abstract

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers