ArXiv TLDR

Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

🐦 Tweet
2604.10741

Fangda Ye, Zhifei Xie, Yuxin Hu, Yihang Yin, Shurui Huang + 3 more

cs.CLcs.AIcs.IR

TLDR

Deep-Reporter is an agentic framework for grounded multimodal long-form generation, integrating text and visuals to create expert-like reports.

Key contributions

  • Agentic Multimodal Search & Filtering: Retrieves and filters both text and information-dense visuals.
  • Checklist-Guided Incremental Synthesis: Ensures coherent image-text integration and optimal citation placement.
  • Recurrent Context Management: Balances long-range coherence with local fluency in generated content.
  • M2LongBench Testbed: Introduces a comprehensive benchmark with 247 tasks across 9 domains.

Why it matters

This paper addresses the critical gap in text-centric agentic search by enabling multimodal long-form generation, crucial for real-world expert reports. It introduces a novel framework and a comprehensive benchmark, highlighting the challenges and potential of integrating diverse information sources.

Original Abstract

Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.