Multimodal embodiment-aware navigation transformer

April 21, 20262604.19267

Louis Dezons, Quentin Picard, Rémi Marsal, François Goulette, David Filliat

cs.RO

TLDR

ViLiNT is a multimodal, embodiment-aware transformer that uses diffusion models and collision prediction for robust, zero-shot robot navigation.

Key contributions

Fuses RGB, LiDAR, goal, and robot embodiment via a transformer for rich environmental understanding.
Conditions a diffusion model with transformer output to generate diverse, navigable robot trajectories.
Employs an embodiment-aware path clearance head to rank trajectories, ensuring collision avoidance.

Why it matters

This paper significantly improves robot navigation robustness, especially collision avoidance, under varied conditions. By integrating multimodal data with embodiment awareness and a novel trajectory ranking mechanism, it enables safer and more reliable zero-shot transfer for ground robots. This approach pushes the boundaries for practical autonomous navigation.

Original Abstract

Goal-conditioned navigation models for ground robots trained using supervised learning show promising zero-shot transfer, but their collision-avoidance capability nevertheless degrades under distribution shift, i.e. environmental, robot or sensor configuration changes. We propose ViLiNT a multimodal, attention-based policy for goal navigation, trained on heterogeneous data from multiple platforms and environments, which improves robustness with two key features. First, we fuse RGB images, 3D LiDAR point clouds, a goal embedding and a robot's embodiment descriptor with a transformer architecture to capture complementary geometry and appearance cues. The transformer's output is used to condition a diffusion model that generates navigable trajectories. Second, using automatically generated offline labels, we train a path clearance prediction head for scoring and ranking trajectories produced by the diffusion model. The diffusion conditioning as well as the trajectory ranking head depend on a robot's embodiment token that allows our model to generate and select trajectories with respect to the robot's dimensions. Across three simulated environments, ViLiNT improves Success Rate on average by 166\% over equivalent state-of-the-art vision-only baseline (NoMaD). This increase in performance is confirmed through real-world deployments of a rover navigating in obstacle fields. These results highlight that combining multimodal fusion with our collision prediction mechanism leads to improved off-road navigation robustness.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers