Beyond Specialization: Robust Reinforcement Learning Navigation via Procedural Map Generators
Christian Jestel, Nicolas Bach, Marvin Wiedemann, Jan Finke, Peter Detzner
TLDR
Procedural map generation and A* subgoals create robust DRL navigation policies that generalize across environments and outperform classical methods, even at speed.
Key contributions
- Systematically compares four procedural map generators for DRL navigation policy generalization.
- Training on a combined set of generators achieves 91.5% success, overcoming specialist overfitting.
- A* path-planner subgoal inputs are crucial for robustness, boosting success to 98.9%.
- DRL policies outperform classical controllers, demonstrating superior speed adaptation and sim-to-real transfer.
Why it matters
This research addresses DRL navigation overfitting by showing how diverse procedural generation and A* subgoals create robust, generalizable policies. It outperforms classical methods, especially at speed, providing a scalable path for more effective real-world robotic deployment.
Original Abstract
Deep reinforcement learning (DRL) navigation policies often overfit to the structure of their training environments, as environmental diversity is typically constrained by the manual effort required to design diverse scenarios. While procedural map generation offers scalable diversity, no prior work systematically compares how different generator types affect policy generalization. We integrate four generators (sparse, maze, graph, and Wave Function Collapse) with guaranteed navigability into MuRoSim, a 2D simulator focusing on training efficiency for LiDAR-based navigation. We cross-evaluate five navigation policies on 1000 seeded maps per generator across three training seeds. Results show a strongly asymmetric cross-generator transfer: a specialist trained on sparse layouts falls to 3.3% success on mazes, whereas a policy trained on the combined generator set achieves 91.5 +/- 1.1% mean success. We further demonstrate that A* path-planner subgoal inputs are the dominant factor for robustness, raising success from the 90.2 +/- 1.4% feedforward baseline to 98.9 +/- 0.4% and outperforming GRU recurrence, which only improves the reactive baseline. The DRL policies outperform a classical Carrot+A* controller, which matches their success only at low speeds (1.0 m/s) but collapses to 24.9% at 2.0 m/s. This highlights learned speed adaptation as the decisive advantage of the learned approach. Real-world experiments on a RoboMaster confirm sim-to-real transfer in a cluttered arena, while a maze-like layout exposes remaining failure modes that recurrence helps mitigate.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.