UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning
TLDR
UnAC is a multimodal prompting method that enhances LMMs' complex reasoning by using adaptive visual prompting, image abstraction, and gradual self-checking.
Key contributions
- Introduces UnAC, a multimodal prompting method for LMMs to tackle complex reasoning tasks.
- Proposes adaptive visual prompting to help LMMs focus on salient image regions for better understanding.
- Designs an image-abstraction prompt to effectively extract key information from visual evidence.
- Implements a gradual self-checking scheme to verify subquestions and answers, improving reasoning.
Why it matters
LMMs often struggle with complex, multi-step visual reasoning despite strong perception. UnAC addresses this by providing a robust prompting method that significantly enhances their ability to understand, abstract, and verify visual information. This leads to more reliable performance on challenging multimodal tasks.
Original Abstract
Although recent LMMs have become much stronger at visual perception, they remain unreliable on problems that require multi-step reasoning over visual evidence. In this paper, we present UnAC (Understanding, Abstracting, and Checking), a multimodal prompting method that strengthens reasoning for complex multimodal tasks in LMMs (e.g., GPT-4o, Gemini 1.5, and GPT-4V). To improve image understanding and capture fine details, we propose an adaptive visual prompting strategy that enables LMMs to focus on salient regions. We further design an image-abstraction prompt to effectively extract key information from images. In addition, we introduce a gradual self-checking scheme that improves reasoning by verifying each decomposed subquestion and its answer. Extensive experiments on three public benchmarks-MathVista, MM-Vet, and MMMU.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.