CArtBench: Evaluating Vision-Language Models on Chinese Art Understanding, Interpretation, and Authenticity
Xuefeng Wei, Zhixuan Wang, Xuan Zhou, Zhi Qu, Hongyao Li + 3 more
TLDR
CArtBench is a new benchmark evaluating Vision-Language Models on complex Chinese art understanding, interpretation, and authenticity tasks.
Key contributions
- Introduces CArtBench, a museum-grounded benchmark for evaluating VLMs on Chinese art understanding.
- Features subtasks like CURATORQA for evidence-grounded reasoning and CATALOGCAPTION for expert-style appreciation.
- Includes REINTERPRET for defensible reinterpretation and CONNOISSEURPAIRS for diagnostic authenticity discrimination.
Why it matters
This paper addresses the gap in evaluating VLMs on nuanced art understanding. CArtBench provides a robust framework to test models beyond simple recognition, revealing current limitations in expert-level reasoning for Chinese art. It pushes the boundaries for future VLM development.
Original Abstract
We introduce CARTBENCH, a museum-grounded benchmark for evaluating vision-language models (VLMs) on Chinese artworks beyond short-form recognition and QA. CARTBENCH comprises four subtasks: CURATORQA for evidence-grounded recognition and reasoning, CATALOGCAPTION for structured four-section expert-style appreciation, REINTERPRET for defensible reinterpretation with expert ratings, and CONNOISSEURPAIRS for diagnostic authenticity discrimination under visually similar confounds. CARTBENCH is built by aligning image-bearing Palace Museum objects from Wikidata with authoritative catalog pages, spanning five art categories across multiple dynasties. Across nine representative VLMs, we find that high overall CURATORQA accuracy can mask sharp drops on hard evidence linking and style-to-period inference; long-form appreciation remains far from expert references; and authenticity-oriented diagnostic discrimination stays near chance, underscoring the difficulty of connoisseur-level reasoning for current models.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.