OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents

May 6, 20262605.05185

Shuang Chen, Kaituo Feng, Hangting Chen, Wenxuan Huang, Dasen Dai + 5 more

cs.CV

TLDR

OpenSearch-VL provides an open-source recipe for training frontier multimodal deep search agents, achieving state-of-the-art performance.

Key contributions

Curated high-quality training data (SearchVL-SFT-36k, SearchVL-RL-8k) reducing retrieval shortcuts.
Developed a diverse tool environment integrating text/image search, OCR, and image processing.
Introduced a fatal-aware GRPO algorithm to manage cascading tool failures in multi-turn interactions.
OpenSearch-VL improves performance by over 10 points on 7 benchmarks, matching commercial models.

Why it matters

Reproducibility of advanced multimodal search agents is challenging due to lack of open data and recipes. OpenSearch-VL addresses this by providing a fully open-source framework, datasets, and algorithms. This enables broader research and development in multimodal deep search, pushing the frontier of agent capabilities.

Original Abstract

Deep search has become a crucial capability for frontier multimodal agents, enabling models to solve complex questions through active search, evidence verification, and multi-step reasoning. Despite rapid progress, top-tier multimodal search agents remain difficult to reproduce, largely due to the absence of open high-quality training data, transparent trajectory synthesis pipelines, or detailed training recipes. To this end, we introduce OpenSearch-VL, a fully open-source recipe for training frontier multimodal deep search agents with agentic reinforcement learning. First, we curated a dedicated pipeline to construct high-quality training data through Wikipedia path sampling, fuzzy entity rewriting, and source-anchor visual grounding, which jointly reduce shortcuts and one-step retrieval collapse. Based on this pipeline, we curate two training datasets, SearchVL-SFT-36k for SFT and SearchVL-RL-8k for RL. Besides, we design a diverse tool environment that unifies text search, image search, OCR, cropping, sharpening, super-resolution, and perspective correction, enabling agents to combine active perception with external knowledge acquisition. Finally, we propose a multi-turn fatal-aware GRPO training algorithm that handles cascading tool failures by masking post-failure tokens while preserving useful pre-failure reasoning through one-sided advantage clamping. Built on this recipe, OpenSearch-VL delivers substantial performance gains, with over 10-point average improvements across seven benchmarks, and achieves results comparable to proprietary commercial models on several tasks. We will release all data, code, and models to support open research on multimodal deep search agents.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers