ArXiv TLDR

Information Router for Mitigating Modality Dominance in Vision-Language Models

🐦 Tweet
2604.16264

Seulgi Kim, Mohit Prabhushankar, Ghassan AlRegib

cs.CVcs.LG

TLDR

MoIR introduces an information router for VLMs to mitigate modality dominance by explicitly enriching less informative tokens with data from stronger modalities.

Key contributions

  • Proposes MoIR, a novel information router for VLMs to reduce modality dominance.
  • Identifies less informative tokens and routes complementary data from stronger modalities.
  • Constructs information-dense token representations prior to large language model processing.
  • Achieves more balanced modality contribution, improving robustness and downstream performance.

Why it matters

Modality dominance is a critical issue in VLMs, hindering reliable performance. MoIR offers a new approach by directly addressing information disparity, rather than just attention. This leads to more robust and balanced multi-modal reasoning, especially when one modality is degraded.

Original Abstract

Vision Language models (VLMs) have demonstrated strong performance across a wide range of benchmarks, yet they often suffer from modality dominance, where predictions rely disproportionately on a single modality. Prior approaches primarily address this issue by steering model's attention allocation, implicitly assuming that all modalities provide sufficient information. However, attention only determines where the model focuses, and cannot enrich information that is missing or ambiguous. In the real world, input modalities often differ in information density and their signal-to-noise ratios. In such cases, simply adjusting model's attention does not resolve the underlying lack of information. In this paper, we propose \textsc{MoIR}: \textit{Multi-modal Information Router}, an information-level fusion method that explicitly reduces information disparity prior to fusion. \textsc{MoIR} identifies less informative tokens and routes complementary information from a stronger modality, constructing information-dense token representations before they are processed by a large language model. By modifying information availability, \textsc{MoIR} enables reliable shifts in modality dominance, even when one modality is degraded. We evaluate \textsc{MoIR} on three widely used multi-modal benchmarks across multiple model backbones. Experimental results show that \textsc{MoIR} consistently demonstrates more balanced modality contribution, and improves robustness and downstream performance, particularly even under modality degradation. These findings demonstrate that explicitly modifying cross-modal information is an effective and complementary strategy for mitigating modality dominance in multi-modal reasoning models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.