Majorization-Guided Test-Time Adaptation for Vision-Language Models under Modality-Specific Shift

April 27, 20262604.24602

Lixian Chen, Mingxuan Huang, Yanhui Chen, Junyi Lin, Yang Shi

cs.CV

TLDR

MG-MTTA enhances vision-language models' test-time adaptation by using a reliability-aware gate to handle asymmetric modality shifts.

Key contributions

Identifies that entropy-based TTA fails for VLMs under asymmetric modality shifts.
Proposes MG-MTTA, a novel adaptation method using a reliability-aware gate and frozen backbone.
Achieves significant accuracy gains on ImageNet benchmarks under various modality shifts.
Introduces a majorization view to analyze multimodal posterior failures and guide adaptation.

Why it matters

This paper addresses a critical failure mode of vision-language models in real-world deployment where visual and textual data can shift asymmetrically. By introducing MG-MTTA, it provides a robust solution that significantly improves accuracy. Its novel approach of controlling modality reliability rather than just prediction entropy offers a new direction for multimodal adaptation.

Original Abstract

Vision-language models transfer well in zero-shot settings, but at deployment the visual and textual branches often shift asymmetrically. Under this condition, entropy-based test-time adaptation can sharpen the fused posterior while increasing error, because an unreliable modality may still dominate fusion. We study this failure mode through a majorization view of multimodal posteriors and cast adaptation as a constrained de-mixing problem on the fused prediction. Based on this view, we propose MG-MTTA, which keeps the backbone frozen and updates only a lightweight gate or adapter. The objective combines fused-posterior entropy minimization with a reliability-aware gate prior built from anchor-based modality consistency and cross-modal conflict. Our analysis gives conditions under which entropy reduction preserves the correct ranking and a threshold that characterizes modality-dominance failure. On the ImageNet-based benchmark, MG-MTTA improves top-1 accuracy from 57.97 to 66.51 under semantics-preserving textual shift and from 21.68 to 26.27 under joint visual-textual shift, while remaining competitive in the visual-only benchmark. These results show that multimodal test-time adaptation should control modality reliability, not just prediction entropy.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers