HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, Mostafa Farouk Senussi, Abdelrahman Abdallah + 1 more
TLDR
HIVE is an LLM framework that injects explicit visual-text reasoning into multimodal retrieval, achieving SOTA on reasoning-intensive queries.
Key contributions
- Introduces HIVE, a plug-and-play LLM framework for hypothesis-driven iterative visual evidence retrieval.
- Operates in four stages: initial retrieval, LLM query synthesis, secondary retrieval, and LLM verification.
- Achieves a new SOTA of 41.7 nDCG@10 on MM-BRIGHT, a +14.1 gain over the best multimodal model.
- Demonstrates strong performance in visually demanding domains like Gaming (68.2) and Chemistry (42.5).
Why it matters
Multimodal retrieval models often fail on complex queries requiring deep integration of visual and textual reasoning. HIVE addresses this by leveraging LLMs to explicitly articulate and verify visual hypotheses, substantially closing the multimodal reasoning gap. This framework significantly improves retrieval performance in technical domains where visual evidence is critical.
Original Abstract
Multimodal retrieval models fail on reasoning-intensive queries where images (diagrams, charts, screenshots) must be deeply integrated with text to identify relevant documents -- the best multimodal model achieves only 27.6 nDCG@10 on MM-BRIGHT, underperforming even strong text-only retrievers (32.2). We introduce \textbf{HIVE} (\textbf{H}ypothesis-driven \textbf{I}terative \textbf{V}isual \textbf{E}vidence Retrieval), a plug-and-play framework that injects explicit visual-text reasoning into a retriever via LLMs. HIVE operates in four stages: (1) initial retrieval over the corpus, (2) LLM-based compensatory query synthesis that explicitly articulates visual and logical gaps observed in top-$k$ candidates, (3) secondary retrieval with the refined query, and (4) LLM verification and reranking over the union of candidates. Evaluated on the multimodal-to-text track of MM-BRIGHT (2,803 real-world queries across 29 technical domains), HIVE achieves a new state-of-the-art aggregated nDCG@10 of \textbf{41.7} -- a \textbf{+9.5} point gain over the best text-only model (DiVeR: 32.2) and \textbf{+14.1} over the best multimodal model (Nomic-Vision: 27.6), where our reasoning-enhanced base retriever contributes 33.2 and the HIVE framework adds a further \textbf{+8.5} points -- with particularly strong results in visually demanding domains (Gaming: 68.2, Chemistry: 42.5, Sustainability: 49.4). Compatible with both standard and reasoning-enhanced retrievers, HIVE demonstrates that LLM-mediated visual hypothesis generation and verification can substantially close the multimodal reasoning gap in retrieval. https://github.com/mm-bright/multimodal-reasoning-retrieval
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.