Cross-Modal Navigation with Multi-Agent Reinforcement Learning

May 7, 20262605.06595

Shuo Liu, Xinzichen Li, Christopher Amato

cs.ROcs.AIcs.LGcs.MA

TLDR

CRONA is a Multi-Agent Reinforcement Learning framework for cross-modal navigation, improving collaboration via auxiliary beliefs and a centralized critic.

Key contributions

Introduces CRONA, a MARL framework for robust cross-modal navigation.
Enhances agent collaboration via auxiliary beliefs and a centralized multi-modal critic.
Multi-agent approaches significantly boost performance and efficiency over single-agent baselines.
Analyzes homogeneous vs. heterogeneous collaboration for different navigation scenarios.

Why it matters

Embodied navigation requires diverse sensory cues, but training monolithic multi-modal models is challenging. CRONA offers a scalable multi-agent solution, significantly improving performance and efficiency. This framework provides a robust approach for complex navigation tasks, advancing practical embodied AI.

Original Abstract

Robust embodied navigation relies on complementary sensory cues. However, high-quality and well-aligned multi-modal data is often difficult to obtain in practice. Training a monolithic model is also challenging as rich multi-modal inputs induce complex representations and substantially enlarge the policy space. Cross-modal collaboration among lightweight modality-specialized agents offers a scalable paradigm. It enables flexible deployment and parallel execution, while preserving the strength of each modality. In this paper, we propose \textbf{CRONA}, a Multi-Agent Reinforcement Learning (MARL) framework for \textbf{Cro}ss-Modal \textbf{Na}vigation. CRONA improves collaboration by leveraging control-relevant auxiliary beliefs and a centralized multi-modal critic with global state. Experiments on visual-acoustic navigation tasks show that multi-agent methods significantly improve performance and efficiency over single-agent baselines. We find that homogeneous collaboration with limited modalities is sufficient for short-range navigation under salient cues; heterogeneous collaboration among agents with complementary modalities is generally efficient and effective; and navigation in large, complex environments requires both richer multi-modal perception and increased model capacity.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers