ArXiv TLDR

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

🐦 Tweet
2604.08541

Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang + 5 more

cs.CVcs.AIcs.CL

TLDR

This paper identifies "Seeing but Not Thinking" in multimodal MoE models, where visual inputs cause routing distraction, and proposes an intervention.

Key contributions

  • Identifies "Seeing but Not Thinking": MoE models perceive images but fail visual reasoning tasks.
  • Reveals visual and domain experts separate, with routing divergence for image inputs in middle layers.
  • Proposes "Routing Distraction" hypothesis: routing fails to activate task-relevant reasoning experts.
  • Introduces a routing-guided intervention method that enhances domain expert activation, improving performance.

Why it matters

This paper uncovers a critical flaw in multimodal MoE models' visual reasoning, termed "Routing Distraction." By proposing and validating an intervention, it offers a path to significantly improve performance on complex vision-language tasks. This work helps us understand and fix how AI models process visual information.

Original Abstract

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.