Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

April 9, 20262604.08541

Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang + 5 more

cs.CVcs.AIcs.CL

TLDR

This paper identifies "Seeing but Not Thinking" in multimodal MoE models, where visual inputs cause routing distraction, and proposes an intervention.

Key contributions

Identifies "Seeing but Not Thinking": MoE models perceive images but fail visual reasoning tasks.
Reveals visual and domain experts separate, with routing divergence for image inputs in middle layers.
Proposes "Routing Distraction" hypothesis: routing fails to activate task-relevant reasoning experts.
Introduces a routing-guided intervention method that enhances domain expert activation, improving performance.

Why it matters

This paper uncovers a critical flaw in multimodal MoE models' visual reasoning, termed "Routing Distraction." By proposing and validating an intervention, it offers a path to significantly improve performance on complex vision-language tasks. This work helps us understand and fix how AI models process visual information.

Original Abstract

Multimodal Mixture-of-Experts (MoE) models have achieved remarkable performance on vision-language tasks. However, we identify a puzzling phenomenon termed Seeing but Not Thinking: models accurately perceive image content yet fail in subsequent reasoning, while correctly solving identical problems presented as pure text. Through systematic analysis, we first verify that cross-modal semantic sharing exists in MoE architectures, ruling out semantic alignment failure as the sole explanation. We then reveal that visual experts and domain experts exhibit layer-wise separation, with image inputs inducing significant routing divergence from text inputs in middle layers where domain experts concentrate. Based on these findings, we propose the Routing Distraction hypothesis: when processing visual inputs, the routing mechanism fails to adequately activate task-relevant reasoning experts. To validate this hypothesis, we design a routing-guided intervention method that enhances domain expert activation. Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks. Our analysis further reveals that domain expert identification locates cognitive functions rather than sample-specific solutions, enabling effective transfer across tasks with different information structures.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers