ArXiv TLDR

InterLV-Search: Benchmarking Interleaved Multimodal Agentic Search

🐦 Tweet
2605.07510

Bohan Hou, Jiuning Gu, Jiayan Guo, Ronghao Dang, Sicong Leng + 3 more

cs.CVcs.CLcs.IR

TLDR

InterLV-Search is a new benchmark for interleaved language-vision agentic search, revealing current multimodal agents struggle with complex visual evidence integration.

Key contributions

  • Introduces InterLV-Search, a benchmark for interleaved language-vision agentic search.
  • Features 2,061 examples across three levels, including active visual evidence seeking and open-web search.
  • Includes unique multi-branch samples for comparing multiple entities during evidence search.
  • Provides InterLV-Agent for standardized tool use, trajectory logging, and evaluation.

Why it matters

Current multimodal search benchmarks are limited, failing to capture the dynamic, interleaved nature of real-world search where visual and textual evidence continuously inform subsequent steps. InterLV-Search addresses this gap, providing a more realistic and challenging evaluation, and highlights critical areas for improvement in multimodal agent capabilities.

Original Abstract

Existing benchmarks for multimodal agentic search evaluate multimodal search and visual browsing, but visual evidence is either confined to the input or treated as an answer endpoint rather than part of an interleaved search trajectory. We introduce \textbf{InterLV-Search}, a benchmark for Interleaved Language-Vision Agentic Search, in which textual and visual evidence is repeatedly used to condition later search. It contains 2,061 examples across three levels: active visual evidence seeking, controlled offline interleaved multimodal search, and open-web interleaved multimodal search. Beyond existing benchmarks, it also includes multimodal multi-branch samples that involve comparison between multiple entities during the evidence search. We construct Level 1 and Level 2 with automated pipelines and Level 3 with a machine-led, human-supervised open-web pipeline. We further provide InterLV-Agent for standardized tool use, trajectory logging, and evaluation. Experiments on proprietary and open-source multimodal agents show that current systems remain far from solving interleaved multimodal search, with the best model below 50% overall accuracy, highlighting challenges in visual evidence seeking, search control, and multimodal evidence integration. We release the benchmark data and evaluation code at https://github.com/hbhalpha/InterLV-Search-Bench

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.