Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng + 5 more
TLDR
DailyClue is a new benchmark for MLLMs that evaluates their ability to perform visual clue-driven reasoning in complex, real-world daily scenarios.
Key contributions
- Introduces DailyClue, a benchmark for visual clue-driven reasoning in authentic daily scenarios.
- Designed with challenging queries that demand active exploration and leveraging of visual clues, not just recognition.
- Comprises a comprehensive dataset spanning four major daily domains and 16 distinct subtasks.
- Reveals that accurate visual clue identification is a critical challenge for current MLLMs and agentic models.
Why it matters
This paper addresses a critical gap in MLLM evaluation by focusing on visual clue-driven reasoning, essential for real-world applications. It highlights that current models often fail to identify and utilize key visual information, paving the way for future research in robust multimodal AI.
Original Abstract
Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.