StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

April 16, 20262604.15237

Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li + 3 more

cs.CV

TLDR

StreamCacheVGGT improves 3D geometry reconstruction from video streams by reimagining cache management with robust scoring and hybrid compression.

Key contributions

Reimagines cache management for streaming 3D geometry reconstruction under constant memory.
Introduces Cross-Layer Consistency-Enhanced Scoring (CLCES) for robust token importance tracking.
Employs Hybrid Cache Compression (HCC) with a three-tier triage strategy to preserve geometric context.
Achieves state-of-the-art accuracy and long-term stability on five diverse benchmarks.

Why it matters

Existing methods for streaming 3D geometry reconstruction suffer from significant information loss and evaluation noise. StreamCacheVGGT introduces a training-free framework that reimagines cache management, leading to superior accuracy and long-term stability. This is vital for real-time applications needing robust, continuous 3D modeling.

Original Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers