Self-Improving 4D Perception via Self-Distillation

April 9, 20262604.08532

Nan Huang, Pengcheng Yu, Weijia Zeng, James M. Rehg, Angjoo Kanazawa + 2 more

cs.CV

TLDR

SelfEvo is a self-improving framework that enhances 4D perception models via self-distillation on unlabeled videos, boosting depth and camera estimation.

Key contributions

Proposes SelfEvo, a self-improving framework for 4D perception using unlabeled videos.
Introduces a self-distillation scheme leveraging spatiotemporal context asymmetry for self-improvement.
Systematically studies design choices for effective self-improvement, including loss signals and asymmetry forms.
Achieves up to 36.5% relative improvement in video depth and 20.1% in camera estimation without labeled data.

Why it matters

Existing 4D perception models heavily rely on expensive ground-truth annotations, limiting their scalability, especially for dynamic scenes. SelfEvo addresses this by enabling continuous model improvement using only unlabeled videos. This significantly reduces annotation costs, making advanced 4D reconstruction more accessible and scalable for real-world applications.

Original Abstract

Large-scale multi-view reconstruction models have made remarkable progress, but most existing approaches still rely on fully supervised training with ground-truth 3D/4D annotations. Such annotations are expensive and particularly scarce for dynamic scenes, limiting scalability. We propose SelfEvo, a self-improving framework that continually improves pretrained multi-view reconstruction models using unlabeled videos. SelfEvo introduces a self-distillation scheme using spatiotemporal context asymmetry, enabling self-improvement for learning-based 4D perception without external annotations. We systematically study design choices that make self-improvement effective, including loss signals, forms of asymmetry, and other training strategies. Across eight benchmarks spanning diverse datasets and domains, SelfEvo consistently improves pretrained baselines and generalizes across base models (e.g. VGGT and $π^3$), with significant gains on dynamic scenes. Overall, SelfEvo achieves up to 36.5% relative improvement in video depth estimation and 20.1% in camera estimation, without using any labeled data. Project Page: https://self-evo.github.io/.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers