ArXiv TLDR

FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers

🐦 Tweet
2605.03669

Timon Homberger, Finn Lukas Busch, Jesús Gerardo Ortega Peimbert, Quantao Yang, Olov Andersson

cs.ROcs.AI

TLDR

FUS3DMaps introduces a dual-layer 3D semantic mapping method that fuses voxel and instance embeddings for scalable, accurate open-vocabulary scene understanding.

Key contributions

  • Develops FUS3DMaps, a novel dual-layer open-vocabulary semantic mapping method.
  • Fuses dense and instance-level embeddings within a shared voxel map for improved quality.
  • Achieves scalability and accuracy by using a sliding window for dense layer fusion.
  • Enables accurate open-vocabulary semantic mapping for large, multi-story environments.

Why it matters

This paper addresses the scalability and accuracy limitations of existing open-vocabulary semantic mapping methods. By introducing a novel dual-layer fusion approach, FUS3DMaps enables robots to understand unseen concepts in large, complex environments. This advancement is crucial for developing more autonomous and adaptable robotic systems.

Original Abstract

Open-vocabulary semantic mapping enables robots to spatially ground previously unseen concepts without requiring predefined class sets. Current training-free methods commonly rely on multi-view fusion of semantic embeddings into a 3D map, either at the instance-level via segmenting views and encoding image crops of segments, or by projecting image patch embeddings directly into a dense semantic map. The latter approach sidesteps segmentation and 2D-to-3D instance association by operating on full uncropped image frames, but existing methods remain limited in scalability. We present FUS3DMaps, an online dual-layer semantic mapping method that jointly maintains both dense and instance-level open-vocabulary layers within a shared voxel map. This design enables further voxel-level semantic fusion of the layer embeddings, combining the complementary strengths of both semantic mapping approaches. We find that our proposed semantic cross-layer fusion approach improves the quality of both the instance-level and dense layers, while also enabling a scalable and highly accurate instance-level map where the dense layer and cross-layer fusion are restricted to a spatial sliding window. Experiments on established 3D semantic segmentation benchmarks as well as a selection of large-scale scenes show that FUS3DMaps achieves accurate open-vocabulary semantic mapping at multi-story building scales. Additional material and code will be made available: https://githanonymous.github.io/FUS3DMaps/.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.