ArXiv TLDR

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

🐦 Tweet
2604.15237

Xuanyi Liu, Deyi Ji, Chunan Yu, Qi Zhu, Xuanfu Li + 3 more

cs.CV

TLDR

StreamCacheVGGT improves 3D geometry reconstruction from video streams by reimagining cache management with robust scoring and hybrid compression.

Key contributions

  • Reimagines cache management for streaming 3D geometry reconstruction under constant memory.
  • Introduces Cross-Layer Consistency-Enhanced Scoring (CLCES) for robust token importance tracking.
  • Employs Hybrid Cache Compression (HCC) with a three-tier triage strategy to preserve geometric context.
  • Achieves state-of-the-art accuracy and long-term stability on five diverse benchmarks.

Why it matters

Existing methods for streaming 3D geometry reconstruction suffer from significant information loss and evaluation noise. StreamCacheVGGT introduces a training-free framework that reimagines cache management, leading to superior accuracy and long-term stability. This is vital for real-time applications needing robust, continuous 3D modeling.

Original Abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.