Appear2Meaning: A Cross-Cultural Benchmark for Structured Cultural Metadata Inference from Images
Yuechen Jiang, Enze Zhang, Md Mohsinul Kabir, Qianqian Xie, Stavroula Golfomitsou + 2 more
TLDR
Introduces Appear2Meaning, a cross-cultural benchmark to evaluate VLMs' ability to infer structured cultural metadata from images, revealing current limitations.
Key contributions
- Introduces Appear2Meaning, a new cross-cultural benchmark for structured cultural metadata inference.
- Evaluates VLMs using an LLM-as-Judge framework for semantic alignment with cultural annotations.
- Reveals VLMs struggle with consistent and grounded predictions across cultures and metadata types.
Why it matters
This paper addresses a critical gap in VLM capabilities beyond basic image captioning. By highlighting current models' limitations in cultural reasoning, it paves the way for future research to develop more culturally aware AI systems, crucial for heritage applications.
Original Abstract
Recent advances in vision-language models (VLMs) have improved image captioning for cultural heritage. However, inferring structured cultural metadata (e.g., creator, origin, period) from visual input remains underexplored. We introduce a multi-category, cross-cultural benchmark for this task and evaluate VLMs using an LLM-as-Judge framework that measures semantic alignment with reference annotations. To assess cultural reasoning, we report exact-match, partial-match, and attribute-level accuracy across cultural regions. Results show that models capture fragmented signals and exhibit substantial performance variation across cultures and metadata types, leading to inconsistent and weakly grounded predictions. These findings highlight the limitations of current VLMs in structured cultural metadata inference beyond visual perception.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.