ArXiv TLDR

Towards Long-horizon Agentic Multimodal Search

🐦 Tweet
2604.12890

Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li + 3 more

cs.CVcs.AI

TLDR

LMM-Searcher enables long-horizon multimodal search by offloading visual data to files, using UIDs, and achieving SOTA performance.

Key contributions

  • Uses a file-based visual representation with UIDs to offload visual assets and reduce context overhead.
  • Implements an on-demand 'fetch-image' tool for progressive visual loading and active perception.
  • Develops a data synthesis pipeline to generate queries for complex cross-modal multi-hop reasoning.
  • Achieves state-of-the-art performance on long-horizon benchmarks, scaling to 100-turn searches.

Why it matters

Existing multimodal search agents struggle with context explosion and losing visual signals over long horizons. LMM-Searcher efficiently manages multimodal information, enabling agents to tackle complex, real-world tasks requiring extended interaction and visual understanding, pushing agentic AI boundaries.

Original Abstract

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.