UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models

April 2, 20262604.02241

Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu + 4 more

cs.CVcs.RO

TLDR

This paper introduces UAV-Track VLA, a new model and benchmark for embodied aerial tracking, achieving real-time, robust performance in complex urban environments.

Key contributions

Presents a new large-scale benchmark and dataset for embodied aerial tracking with UAVs.
Proposes UAV-Track VLA, an improved Vision-Language-Action model for robust tracking.
Features a temporal compression net and a dual-branch decoder for efficient, fine-grained actions.
Achieves superior real-time performance, zero-shot generalization, and reduced inference latency.

Why it matters

This paper significantly advances embodied aerial tracking by providing both a crucial benchmark and an efficient VLA model. Its real-time capabilities and robust performance in complex urban settings pave the way for more autonomous and reliable UAV applications.

Original Abstract

Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76\% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4\% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track\_VLA.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers