ArXiv TLDR

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

🐦 Tweet
2604.22498

Lihao Zheng, Zhenwei Shao, Yu Zhou, Yan Yang, Xintian Shen + 3 more

cs.CVcs.AI

TLDR

CGC is a low-cost framework that boosts MLLM fine-grained multi-image understanding by using compositional grounded contrast and spatial rewards.

Key contributions

  • Low-cost framework boosting MLLM fine-grained multi-image understanding.
  • Constructs compositional multi-image training via Inter-Image and Intra-Image Contrast.
  • Introduces Rule-Based Spatial Reward to improve source-image attribution and spatial alignment.
  • Achieves state-of-the-art results on multi-image benchmarks and improves broader MLLM tasks.

Why it matters

MLLMs struggle with fine-grained multi-image understanding, leading to issues like spatial hallucination. CGC provides a low-cost solution by creating effective training instances and a novel spatial reward system. This significantly improves MLLM performance on complex multi-image and broader multimodal reasoning tasks.

Original Abstract

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.