Flash-Mono: Feed-Forward Accelerated Gaussian Splatting Monocular SLAM

April 3, 20262604.03092

Zicheng Zhang, Ke Wu, Xiangting Meng, Keyu Liu, Jieru Zhao + 1 more

cs.RO

TLDR

Flash-Mono introduces a feed-forward Gaussian Splatting SLAM, achieving 10x speedup and improved accuracy by predicting Gaussian attributes directly.

Key contributions

Utilizes a recurrent feed-forward frontend to predict camera poses and per-pixel Gaussian properties from multi-frame features.
Achieves a 10x speedup by directly predicting Gaussian attributes, bypassing time-consuming per-frame optimization.
Implements efficient hidden-state-based loop closure and global Sim(3) optimization to mitigate drift.
Enhances geometric fidelity by using 2D Gaussian surfels instead of conventional 3D Gaussian ellipsoids.

Why it matters

Monocular Gaussian Splatting SLAM is slow and lacks consistency. Flash-Mono addresses these by introducing a feed-forward paradigm that predicts Gaussian attributes directly. This enables real-time, high-quality 3D reconstruction and tracking, crucial for embodied perception and robotics.

Original Abstract

Monocular 3D Gaussian Splatting SLAM suffers from critical limitations in time efficiency, geometric accuracy, and multi-view consistency. These issues stem from the time-consuming $\textit{Train-from-Scratch}$ optimization and the lack of inter-frame scale consistency from single-frame geometry priors. We contend that a feed-forward paradigm, leveraging multi-frame context to predict Gaussian attributes directly, is crucial for addressing these challenges. We present Flash-Mono, a system composed of three core modules: a feed-forward prediction frontend, a 2D Gaussian Splatting mapping backend, and an efficient hidden-state-based loop closure module. We trained a recurrent feed-forward frontend model that progressively aggregates multi-frame visual features into a hidden state via cross attention and jointly predicts camera poses and per-pixel Gaussian properties. By directly predicting Gaussian attributes, our method bypasses the burdensome per-frame optimization required in optimization-based GS-SLAM, achieving a $\textbf{10x}$ speedup while ensuring high-quality rendering. The power of our recurrent architecture extends beyond efficient prediction. The hidden states act as compact submap descriptors, facilitating efficient loop closure and global $\mathrm{Sim}(3)$ optimization to mitigate the long-standing challenge of drift. For enhanced geometric fidelity, we replace conventional 3D Gaussian ellipsoids with 2D Gaussian surfels. Extensive experiments demonstrate that Flash-Mono achieves state-of-the-art performance in both tracking and mapping quality, highlighting its potential for embodied perception and real-time reconstruction applications. Project page: https://victkk.github.io/flash-mono.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers