ArXiv TLDR

Rule-based High-Level Coaching for Goal-Conditioned Reinforcement Learning in Search-and-Rescue UAV Missions Under Limited-Simulation Training

🐦 Tweet
2604.26833

Mahya Ramezani, Holger Voos

cs.ROcs.AIcs.LG

TLDR

A hierarchical framework for SAR UAVs combines rule-based high-level coaching with online goal-conditioned RL, improving safety and sample efficiency.

Key contributions

  • Proposes a hierarchical framework combining a rule-based high-level advisor with an online goal-conditioned RL controller.
  • The high-level advisor provides interpretable mission and safety guidance through recommended and avoided actions.
  • The low-level RL controller learns online using mode-aware prioritized replay augmented with rule-derived metadata.
  • Significantly improves early safety and sample efficiency in SAR UAV missions, reducing collision terminations.

Why it matters

This research addresses critical challenges in deploying AI for real-world UAV missions, especially in search-and-rescue, where simulation training is limited. By integrating interpretable rule-based guidance with adaptive reinforcement learning, it significantly enhances early safety and operational efficiency, vital for safety-critical scenarios.

Original Abstract

This paper presents a hierarchical decision-making framework for unmanned aerial vehicle (UAV) missions motivated by search-and-rescue (SAR) scenarios under limited simulation training. The framework combines a fixed rule-based high-level advisor with an online goal-conditioned low-level reinforcement learning (RL) controller. To stress-test early adaptation, we also consider a strict no-pretraining deployment regime. The high-level advisor is defined offline from a structured task specification and compiled into deterministic rules. It provides interpretable mission- and safety-aware guidance through recommended actions, avoided actions, and regime-dependent arbitration weights. The low-level controller learns online from task-defined dense rewards and reuses experience through a mode-aware prioritized replay mechanism augmented with rule-derived metadata. We evaluate the framework on two tasks: battery-aware multi-goal delivery and moving-target delivery in obstacle-rich environments. Across both tasks, the proposed method improves early safety and sample efficiency primarily by reducing collision terminations, while preserving the ability to adapt online to scenario-specific dynamics.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.