Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap

🐦 Tweet

April 15, 20262604.13654

Hanxuan Chen, Jie Zheng, Siqi Yang, Tianle Zeng, Siwei Feng + 7 more

cs.RO

TLDR

Comprehensive survey of UAV vision-language navigation, challenges, and a roadmap for future embodied AI research.

Key contributions

Defines UAV-VLN tasks and traces evolution from modular to foundation model-based systems.
Reviews key resources: simulators, datasets, and evaluation metrics for standardized research.
Analyzes challenges like sim-to-real gap, outdoor perception, linguistic ambiguity, and hardware limits.
Proposes a research roadmap focusing on multi-agent coordination and air-ground robotic collaboration.

Why it matters

This paper consolidates UAV vision-language navigation progress and challenges, guiding future research in embodied AI. It highlights critical barriers and emerging directions for real-world UAV deployment.

Original Abstract

Vision-and-Language Navigation for Unmanned Aerial Vehicles (UAV-VLN) represents a pivotal challenge in embodied artificial intelligence, focused on enabling UAVs to interpret high-level human commands and execute long-horizon tasks in complex 3D environments. This paper provides a comprehensive and structured survey of the field, from its formal task definition to the current state of the art. We establish a methodological taxonomy that charts the technological evolution from early modular and deep learning approaches to contemporary agentic systems driven by large foundation models, including Vision-Language Models (VLMs), Vision-Language-Action (VLA) models, and the emerging integration of generative world models with VLA architectures for physically-grounded reasoning. The survey systematically reviews the ecosystem of essential resources simulators, datasets, and evaluation metrics that facilitates standardized research. Furthermore, we conduct a critical analysis of the primary challenges impeding real-world deployment: the simulation-to-reality gap, robust perception in dynamic outdoor settings, reasoning with linguistic ambiguity, and the efficient deployment of large models on resource-constrained hardware. By synthesizing current benchmarks and limitations, this survey concludes by proposing a forward-looking research roadmap to guide future inquiry into key frontiers such as multi-agent swarm coordination and air-ground collaborative robotics.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers