SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation
Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz
TLDR
SpaCeFormer is a fast, proposal-free space-curve transformer for open-vocabulary 3D instance segmentation, achieving state-of-the-art results.
Key contributions
- Introduces SpaCeFormer, a proposal-free space-curve transformer for 3D instance segmentation.
- Achieves 0.14 seconds per scene, 2-3 orders of magnitude faster than prior multi-stage methods.
- Presents SpaCeFormer-3M, a large 3D dataset with 3.0M captions, boosting mask recall 21x.
- Sets new state-of-the-art on ScanNet200 (11.1 mAP, 2.8x better), ScanNet++, and Replica.
Why it matters
This paper addresses critical bottlenecks in open-vocabulary 3D instance segmentation, making it practical for real-world robotics and AR/VR applications. By significantly improving speed and accuracy with a novel architecture and large dataset, it pushes the boundaries of efficient 3D scene understanding.
Original Abstract
Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs at 0.14 seconds per scene, 2-3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21x higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU > 0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8x improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.