ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

April 21, 20262604.19083

Kun Wang, Cheng Qian, Miao Yu, Lilan Peng, Liang Lin + 4 more

cs.CRcs.AI

TLDR

ProjLens demystifies MLLM backdoors, revealing how projector fine-tuning introduces vulnerabilities and detailing their low-rank structure and activation mechanism.

Key contributions

Introduces ProjLens, an interpretability framework for MLLM backdoors.
Shows projector fine-tuning alone can introduce MLLM backdoor vulnerabilities.
Reveals backdoor-critical parameters are in a low-rank subspace of the projector.
Explains backdoor activation via semantic shifts scaled by input norm.

Why it matters

MLLMs face critical safety threats from backdoors, but their mechanisms are opaque. ProjLens clarifies how these backdoors work, especially concerning projector fine-tuning. This understanding is vital for developing robust defenses and safer multimodal AI systems.

Original Abstract

Multimodal Large Language Models (MLLMs) have achieved remarkable success in cross-modal understanding and generation, yet their deployment is threatened by critical safety vulnerabilities. While prior works have demonstrated the feasibility of backdoors in MLLMs via fine-tuning data poisoning to manipulate inference, the underlying mechanisms of backdoor attacks remain opaque, complicating the understanding and mitigation. To bridge this gap, we propose ProjLens, an interpretability framework designed to demystify MLLMs backdoors. We first establish that normal downstream task alignment--even when restricted to projector fine--tuning--introduces vulnerability to backdoor injection, whose activation mechanism is different from that observed in text-only LLMs. Through extensive experiments across four backdoor variants, we uncover:(1) Low-Rank Structure: Backdoor injection updates appear overall full-rank and lack dedicated ``trigger neurons'', but the backdoor-critical parameters are encoded within a low-rank subspace of the projector;(2) Activation Mechanism: Both clean and poisoned embedding undergoes a semantic shift toward a shared direction aligned with the backdoor target, but the shifting magnitude scales linearly with the input norm, resulting in the distinct backdoor activation on poisoned samples. Our code is available at: https://anonymous.4open.science/r/ProjLens-8FD7

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers