TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens

April 16, 20262604.15239

Jiawei Ren, Michal Jan Tyszkiewicz, Jiahui Huang, Zan Gojcic

cs.CV

TLDR

TokenGS introduces an encoder-decoder with learnable tokens to directly predict 3D Gaussian means, achieving state-of-the-art feed-forward 3DGS.

Key contributions

Proposes TokenGS, an encoder-decoder architecture with learnable tokens for 3D Gaussian Splatting.
Directly regresses 3D Gaussian mean coordinates, decoupling prediction from pixel resolution.
Achieves state-of-the-art feed-forward 3DGS reconstruction on static and dynamic scenes.
Improves robustness to pose noise and multiview inconsistencies, enabling efficient token-space optimization.

Why it matters

TokenGS addresses limitations in current 3DGS prediction by decoupling Gaussian primitive count from input resolution. This leads to more robust and efficient 3D scene reconstruction, setting a new benchmark for feed-forward methods. It also enables recovery of complex scene attributes like scene flow.

Original Abstract

In this work, we revisit several key design choices of modern Transformer-based approaches for feed-forward 3D Gaussian Splatting (3DGS) prediction. We argue that the common practice of regressing Gaussian means as depths along camera rays is suboptimal, and instead propose to directly regress 3D mean coordinates using only a self-supervised rendering loss. This formulation allows us to move from the standard encoder-only design to an encoder-decoder architecture with learnable Gaussian tokens, thereby unbinding the number of predicted primitives from input image resolution and number of views. Our resulting method, TokenGS, demonstrates improved robustness to pose noise and multiview inconsistencies, while naturally supporting efficient test-time optimization in token space without degrading learned priors. TokenGS achieves state-of-the-art feed-forward reconstruction performance on both static and dynamic scenes, producing more regularized geometry and more balanced 3DGS distribution, while seamlessly recovering emergent scene attributes such as static-dynamic decomposition and scene flow.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers