Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment

May 8, 20262605.08064

Jerry Jiang, Haowen Sun, Denis Gudovskiy, Yohei Nakata, Tomoyuki Okuno + 2 more

cs.CV

TLDR

Proxy3D introduces efficient 3D representations for Vision-Language Models by using semantic-aware clustering of scene features from video frames.

Key contributions

Introduces Proxy3D, a method for efficient and compact 3D proxy representations for Vision-Language Models.
Generates 3D proxies by semantically clustering scene features extracted from video frames via encoders.
Achieves competitive or SOTA performance in 3D VQA, visual grounding, and spatial intelligence tasks.
Curates the SpaceSpan dataset and employs multi-stage training for effective VLM representation alignment.

Why it matters

This paper addresses the critical need for efficient and consistent 3D reasoning in Vision-Language Models. Proxy3D's novel 3D proxy representations overcome limitations of prior 2D-centric methods, enhancing spatial intelligence and computational efficiency. This work advances VLMs towards more robust and practical understanding of the 3D world.

Original Abstract

Spatial intelligence in vision-language models (VLMs) attracts research interest with the practical demand to reason in the 3D world.Despite promising results, most existing methods follow the conventional 2D pipeline in VLMs and use pixel-aligned representations for the vision modality. However, correspondence-based models with implicit 3D scene understanding often fail to achieve spatial consistency, and representation-based models with 3D geometric priors lack efficiency in vision sequence serialization. To address this, we propose a Proxy3D method with compact yet comprehensive 3D proxy representations for the vision modality. Given only video frames as input, we employ semantic and geometric encoders to extract scene features and then perform their semantic-aware clustering to obtain a set of proxies in the 3D space. For representation alignment, we further curate the SpaceSpan dataset and apply multi-stage training to adopt the proposed 3D proxy representations with the VLM. When using shorter sequences for vision information, our method achieves competitive or state-of-the-art performance in 3D visual question answering, visual grounding and general spatial intelligence benchmarks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers