ArXiv TLDR

VADF: Vision-Adaptive Diffusion Policy Framework for Efficient Robotic Manipulation

🐦 Tweet
2604.15938

Xinglei Yu, Zhenyang Liu, Shufeng Nan, Simo Wu, Yanwei Fu

cs.RO

TLDR

VADF enhances robotic manipulation diffusion policies by using vision-driven adaptive training and inference to speed up convergence and improve success rates.

Key contributions

  • Employs an Adaptive Loss Network (ALN) for real-time sample difficulty quantification and weighted training.
  • Utilizes a Hierarchical Vision Task Segmenter (HVTS) to adaptively decompose tasks based on visual input.
  • Optimizes inference by assigning varying noise schedules to simple vs. complex subtasks, reducing overhead.

Why it matters

Diffusion policies are key in robotics but struggle with training convergence and inference failures. VADF provides a model-agnostic, dual-adaptive solution that significantly improves both training efficiency and early inference success. This makes diffusion policies more robust and practical for real-world robotic manipulation.

Original Abstract

Diffusion policies are becoming mainstream in robotic manipulation but suffer from hard negative class imbalance due to uniform sampling and lack of sample difficulty awareness, leading to slow training convergence and frequent inference timeout failures. We propose VADF (Vision-Adaptive Diffusion Policy Framework), a vision-driven dual-adaptive framework that significantly reduces convergence steps and achieves early success in inference, with model-agnostic design enabling seamless integration into any diffusion policy architecture. During training, we introduce Adaptive Loss Network (ALN), a lightweight MLP-based loss predictor that quantifies per-step sample difficulty in real time. Guided by hard negative mining, it performs weighted sampling to prioritize high-loss regions, enabling adaptive weight updates and faster convergence. In inference, we design the Hierarchical Vision Task Segmenter (HVTS), which decomposes high-level task instructions into multi-stage low-level sub-instructions based on visual input. It adaptively segments action sequences into simple and complex subtasks by assigning shorter noise schedules with longer direct execution sequences to simple actions, and longer noise steps with shorter execution sequences to complex ones, thereby dramatically reducing computational overhead and significantly improving the early success rate.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.