ArXiv TLDR

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

🐦 Tweet
2605.02735

Xin Zhang, Qiqi Tao, Jiawei Du, Moyun Liu, Joey Tianyi Zhou

cs.LG

TLDR

This paper introduces a method to 'unsilence' visual latents in MLLMs by optimizing latent reasoning at inference time, improving their contribution to predictions.

Key contributions

  • Identifies "Silenced Visual Latents," where MLLM latents are semantically rich but suppressed by direct visual input shortcuts.
  • Proposes inference-time latent optimization to unleash suppressed reasoning without updating backbone parameters.
  • Uses query-guided contrastive alignment (Stage I) to warm up latents and prevent collapse.
  • Employs a confidence-progression reward (Stage II) to route predictions through latent reasoning.

Why it matters

This paper uncovers and resolves a key issue in MLLMs: the suppression of visual latent reasoning. Its inference-time optimization method significantly boosts performance by leveraging existing latent knowledge more effectively, without costly retraining.

Original Abstract

Continuous latent-space reasoning offers a compact alternative to textual chain-of-thought for multimodal models, enabling high-dimensional visual evidence to be integrated without explicit reasoning tokens. However, we identify a previously overlooked optimization pathology in existing latent visual reasoning methods: although visual latents become semantically enriched during training, their contribution to final answer prediction is systematically suppressed. Within the shared parameter space, the autoregressive objective favors shortcut reliance on direct visual input, driving latent tokens toward transition-like states rather than informative reasoning content. We term this phenomenon Silenced Visual Latents. To address it, we disentangle the two conflicting objectives by directly optimizing the latent reasoning at inference time, keeping backbone parameters frozen. In Stage I, visual latents are warmed up via query-guided contrastive latent--visual alignment, improving semantic quality while preventing latent collapse. In Stage II, the latent reasoning is further optimized via a confidence-progression reward, which incentivizes predicted token distributions along the latent span to become progressively more concentrated, routing predictions through the latent reasoning rather than bypassing it. Experiments across eight benchmarks and four model backbones show that inference-time latent optimization, without any parameter updates, effectively unleashes the suppressed reasoning capacity of visual latents.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.