Steerable Visual Representations

April 2, 20262604.02327

Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano

cs.CVcs.AI

TLDR

Introduces Steerable Visual Representations, allowing natural language to direct global and local visual features in ViTs for improved task focus.

Key contributions

Introduces Steerable Visual Representations guided by natural language.
Uses early fusion: injects text into visual encoder layers via cross-attention.
Enables focusing on specific objects while preserving representation quality.
Achieves strong zero-shot performance on anomaly detection and object discrimination.

Why it matters

Existing ViTs lack steerability, focusing only on salient features, and MLLMs lose effectiveness for generic visual tasks. This paper introduces a novel way to steer visual representations with natural language, bridging this gap. This enables more flexible and effective application of visual features to diverse, targeted tasks.

Original Abstract

Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers