POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

April 15, 20262604.14029

Yikun Liu, Yuan Liu, Le Tian, Xiao Zhou, Jiangchao Yao + 2 more

cs.CV

TLDR

POINTS-Seeker introduces a novel approach to train multimodal agentic search models from scratch, improving long-horizon interaction performance.

Key contributions

Introduces Agentic Seeding, a dedicated phase to foster foundational agentic behaviors.
Proposes V-Fold, an adaptive history compression scheme for long-horizon interactions.
Develops POINTS-Seeker-8B, a SOTA multimodal agentic search model outperforming existing models.

Why it matters

This paper pioneers training multimodal agentic search models from scratch, a new paradigm beyond retrofitting LMMs. It introduces Agentic Seeding and V-Fold to resolve long-horizon interaction bottlenecks, leading to a SOTA model for knowledge-intensive visual reasoning.

Original Abstract

While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model's ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers