HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu + 6 more
TLDR
HiVLA is a hierarchical robot manipulation system that decouples VLM planning from motor control, improving long-horizon and fine-grained tasks.
Key contributions
- Proposes HiVLA, a hierarchical system decoupling high-level VLM planning from low-level motor control for robotics.
- VLM planner generates structured plans with subtask instructions and precise visual groundings (bounding boxes).
- Introduces a Diffusion Transformer (DiT) action expert with cascaded cross-attention for robust low-level execution.
- Significantly outperforms end-to-end baselines in long-horizon skill composition and fine-grained object manipulation.
Why it matters
This paper addresses the trade-off between VLM reasoning and control in robotic manipulation. By decoupling planning and execution, HiVLA preserves VLM's zero-shot capabilities while enabling robust physical actions. This approach significantly advances robot performance in complex, multi-step tasks and precise object handling.
Original Abstract
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.