BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

April 8, 20262604.07201

Mohamed Darwish Mounis, Mohamed Mahmoud, Shaimaa Sedek, Mahmoud Abdalla, Mahmoud SalahEldin Kasem + 2 more

cs.IRcs.CV

TLDR

BRIDGE improves multimodal-to-text retrieval by using a reinforcement-learned query aligner (FORGE) and a reasoning-enhanced retriever (LENS).

Key contributions

Introduces BRIDGE, a two-component system for multimodal-to-text retrieval without multimodal encoders.
FORGE, an RL-trained query aligner, distills noisy multimodal queries into compact search strings.
LENS, a reasoning-enhanced dense retriever, handles the intent-rich queries produced by FORGE.
Achieves 33.3 nDCG@10 on MM-BRIGHT, surpassing all multimodal and text-only retrieval baselines.

Why it matters

Multimodal-to-text retrieval struggles due to noisy queries. This paper demonstrates that query alignment is the key bottleneck, not the retriever itself. BRIDGE offers a novel solution that significantly improves performance by distilling multimodal queries, paving the way for more effective and accessible multimodal search.

Original Abstract

Multimodal retrieval systems struggle to resolve image-text queries against text-only corpora: the best vision-language encoder achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming strong text-only retrievers. We argue the bottleneck is not the retriever but the query -- raw multimodal queries entangle visual descriptions, conversational noise, and retrieval intent in ways that systematically degrade embedding similarity. We present \textbf{BRIDGE}, a two-component system that resolves this mismatch without multimodal encoders. \textbf{FORGE} (\textbf{F}ocused Retrieval Query Generato\textbf{r}) is a query alignment model trained via reinforcement learning, which distills noisy multimodal queries into compact, retrieval-optimized search strings. \textbf{LENS} (\textbf{L}anguage-\textbf{E}nhanced \textbf{N}eural \textbf{S}earch) is a reasoning-enhanced dense retriever fine-tuned on reasoning-intensive retrieval data to handle the intent-rich queries FORGE produces. Evaluated on MM-BRIGHT (2,803 queries, 29 domains), BRIDGE achieves \textbf{29.7} nDCG@10, surpassing all multimodal encoder baselines including Nomic-Vision (27.6). When FORGE is applied as a plug-and-play aligner on top of Nomic-Vision, the combined system reaches \textbf{33.3} nDCG@10 -- exceeding the best text-only retriever (32.2) -- demonstrating that \textit{query alignment} is the key bottleneck in multimodal-to-text retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers