LEXI-SG: Monocular 3D Scene Graph Mapping with Room-Guided Feed-Forward Reconstruction
Christina Kassab, Hyeonjae Gil, Matías Mattamala, Ayoung Kim, Maurice Fallon
TLDR
LEXI-SG is the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input, enabling scalable reconstruction.
Key contributions
- First dense monocular visual mapping for open-vocabulary 3D scene graphs using only RGB.
- Exploits foundation models for room partitioning, deferring reconstruction for scalability.
- Proposes a room-based factor graph for global alignment and semantic hierarchy.
- Supports open-vocabulary object segmentation and tracking within each room.
Why it matters
This paper introduces a novel approach to 3D scene graph mapping, overcoming the reliance on depth sensors or LiDAR. By using only monocular RGB, it makes 3D scene understanding more accessible and scalable for robotics. Its room-guided strategy ensures consistent, accurate, and open-vocabulary scene representations.
Original Abstract
Scene graphs are becoming a standard representation for robot navigation, providing hierarchical geometric and semantic scene understanding. However, most scene graph mapping methods rely on depth cameras or LiDAR sensors. In this work, we present LEXI-SG, the first dense monocular visual mapping system for open-vocabulary 3D scene graphs using only RGB camera input. Our approach exploits the semantic priors of open-vocabulary foundation models to partition the scene into rooms, deferring feed-forward reconstruction to when each room is fully observed -- enabling scalable dense mapping without sliding-window scale inconsistencies. We propose a room-based factor graph formulation to globally align room reconstructions while preserving local map consistency and naturally imposing the semantic scene graph hierarchy. Within each room, we further support open-vocabulary object segmentation and tracking. We validate LEXI-SG on indoor scenes from the Habitat-Matterport 3D and self-collected egocentric office sequences. We evaluate its performance against existing feed-forward SLAM methods, as well as established scene graphs baselines. We demonstrate improved trajectory estimation and dense reconstruction, as well as, competitive performance in open-vocabulary segmentation. LEXI-SG shows that accurate, scalable, open-vocabulary 3D scene graphs can be achieved from monocular RGB alone. Our project page and office sequences are available here: https://ori-drs.github.io/lexisg-web/.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.