Indexing Multimodal Language Models for Large-scale Image Retrieval

April 14, 20262604.13268

Bahey Tharwat, Giorgos Kordopatis-Zilos, Pavel Suma, Ian Reid, Giorgos Tolias

cs.CVcs.CLcs.IR

TLDR

MLLMs are used as training-free similarity estimators for large-scale image retrieval, showing robust zero-shot performance.

Key contributions

Uses MLLMs as training-free similarity estimators for instance-level image-to-image retrieval.
Converts MLLM next-token probabilities from paired images into similarity scores for zero-shot re-ranking.
Achieves scalability by combining MLLMs with memory-efficient indexing and top-k candidate re-ranking.
Outperforms task-specific re-rankers and shows robustness to clutter, occlusion, and small objects.

Why it matters

This paper demonstrates MLLMs' untapped potential for vision-only tasks, specifically large-scale image retrieval, without requiring fine-tuning. It provides a robust, zero-shot method that outperforms specialized re-rankers, opening new avenues for open-world retrieval systems.

Original Abstract

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers