FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation

April 17, 20262604.16298

Dian Shao, Zhengzheng Xu, Peiyang Wang, Like Liu, Yule Wang + 2 more

cs.CVcs.RO

TLDR

FineCog-Nav introduces a cognitive modular framework for zero-shot UAV navigation, outperforming baselines in complex environments.

Key contributions

Proposes FineCog-Nav, a top-down framework with fine-grained cognitive modules for UAV navigation.
Modules use moderate-sized foundation models with role-specific prompts for effective collaboration.
Introduces AerialVLN-Fine, a new benchmark for fine-grained evaluation of UAV navigation.
Outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization.

Why it matters

This paper addresses limitations in zero-shot UAV navigation by introducing a human-cognition-inspired modular framework. Its fine-grained approach improves interpretability and performance in complex, long-horizon tasks. The new benchmark also facilitates more detailed evaluation of future methods.

Original Abstract

UAV vision-language navigation (VLN) requires an agent to navigate complex 3D environments from an egocentric perspective while following ambiguous multi-step instructions over long horizons. Existing zero-shot methods remain limited, as they often rely on large base models, generic prompts, and loosely coordinated modules. In this work, we propose FineCog-Nav, a top-down framework inspired by human cognition that organizes navigation into fine-grained modules for language processing, perception, attention, memory, imagination, reasoning, and decision-making. Each module is driven by a moderate-sized foundation model with role-specific prompts and structured input-output protocols, enabling effective collaboration and improved interpretability. To support fine-grained evaluation, we construct AerialVLN-Fine, a curated benchmark of 300 trajectories derived from AerialVLN, with sentence-level instruction-trajectory alignment and refined instructions containing explicit visual endpoints and landmark references. Experiments show that FineCog-Nav consistently outperforms zero-shot baselines in instruction adherence, long-horizon planning, and generalization to unseen environments. These results suggest the effectiveness of fine-grained cognitive modularization for zero-shot aerial navigation. Project page: https://smartdianlab.github.io/projects-FineCogNav.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers