Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

April 15, 20262604.14041

Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng + 5 more

cs.CV

TLDR

DailyClue is a new benchmark for MLLMs that evaluates their ability to perform visual clue-driven reasoning in complex, real-world daily scenarios.

Key contributions

Introduces DailyClue, a benchmark for visual clue-driven reasoning in authentic daily scenarios.
Designed with challenging queries that demand active exploration and leveraging of visual clues, not just recognition.
Comprises a comprehensive dataset spanning four major daily domains and 16 distinct subtasks.
Reveals that accurate visual clue identification is a critical challenge for current MLLMs and agentic models.

Why it matters

This paper addresses a critical gap in MLLM evaluation by focusing on visual clue-driven reasoning, essential for real-world applications. It highlights that current models often fail to identify and utilize key visual information, paving the way for future research in robust multimodal AI.

Original Abstract

Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs' pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers