Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

May 12, 20262605.12374

Yanting Miao, Yutao Sun, Dexin Wang, Mengyu Zhou, Pascal Poupart + 6 more

cs.CVcs.AIcs.LG

TLDR

GAP proposes a granular alignment paradigm to stabilize visual latent reasoning in MLLMs by addressing feature-space mismatches, improving performance.

Key contributions

Identifies and addresses a feature-space mismatch in MLLMs causing unstable visual latent reasoning.
Introduces GAP with feature-level alignment using a PCA-aligned latent head for input-compatible latents.
Incorporates context-level alignment with auxiliary visual supervision and capacity-guided selective supervision.
Achieves state-of-the-art perception and reasoning performance on Qwen2.5-VL 7B.

Why it matters

This paper tackles a critical instability in visual latent reasoning within MLLMs, a key area for advanced multimodal AI. By proposing GAP, it offers a robust method to align visual latents, significantly boosting reasoning performance. This work is crucial for developing more reliable and powerful MLLMs.

Original Abstract

Visual latent reasoning lets a multimodal large language model (MLLM) create intermediate visual evidence as continuous tokens, avoiding external tools or image generators. However, existing methods usually follow an output-as-input latent paradigm and yield unstable gains. We identify evidence for a feature-space mismatch that can contribute to this instability: dominant visual-latent models build on pre-norm MLLMs and reuse decoder hidden states as predicted latent inputs, even though these states occupy a substantially different norm regime from the input embeddings the model was trained to consume~\citep{xie2025mhc,li2026siamesenorm,team2026attention}. This mismatch can make direct latent feedback unreliable. Motivated by this diagnosis, we propose \textbf{GAP}, a \textbf{G}ranular \textbf{A}lignment \textbf{P}aradigm for visual latent modeling. GAP aligns visual latent reasoning at three levels: feature-level alignment maps decoder outputs into input-compatible visual latents through a lightweight PCA-aligned latent head; context-level alignment grounds latent targets with inspectable auxiliary visual supervision; and capacity-guided alignment assigns latent supervision selectively to examples where the base MLLM struggles. On Qwen2.5-VL 7B, the resulting model achieves the best mean aggregate perception and reasoning performance among our supervised variants. Inference-time intervention probing further suggests that generated latents provide task-relevant visual signal beyond merely adding token slots.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers