ArXiv TLDR

A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

🐦 Tweet
2604.19689

Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng + 4 more

cs.AI

TLDR

A-MAR is an agent-based framework that uses structured reasoning plans for interpretable, evidence-grounded multimodal art retrieval and understanding.

Key contributions

  • Proposes A-MAR, an agent-based framework for multimodal art retrieval using structured reasoning plans.
  • Conditions retrieval on explicit reasoning plans for targeted evidence selection and step-wise explanations.
  • Introduces ArtCoT-QA, a diagnostic benchmark for multi-step reasoning in the art domain.
  • Outperforms MLLM baselines and static retrieval in explanation quality and evidence grounding.

Why it matters

This paper addresses the interpretability and evidence grounding limitations of MLLMs in art understanding. By explicitly conditioning retrieval on reasoning plans, A-MAR offers a more transparent and verifiable approach. Its success paves the way for goal-driven AI systems in cultural heritage.

Original Abstract

Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.