FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
Mingao Tan, Yiyang Li, Shanze Wang, Xinming Zhang, Wei Zhang
TLDR
FSUNav introduces a cerebrum-cerebellum architecture with VLMs for fast, safe, and universal zero-shot goal-oriented navigation across diverse robots.
Key contributions
- Cerebellum module provides a DRL-based local planner for universal, efficient, and safe navigation across heterogeneous robots.
- Cerebrum module uses VLMs for zero-shot, open-vocabulary goal navigation without predefined IDs.
- Supports multimodal inputs (text, images) for enhanced generalization, real-time performance, and robustness.
- Achieves state-of-the-art performance on object, instance image, and task navigation benchmarks.
Why it matters
Current vision-language navigation struggles with robot compatibility, safety, and open-vocabulary tasks. FSUNav addresses these by integrating VLMs into a novel architecture, enabling universal, safe, and zero-shot navigation across diverse robots. This significantly advances robotic autonomy and practical deployment.
Original Abstract
Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.