ArXiv TLDR

Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

🐦 Tweet
2604.12446

Zida Li, Jun Li, Yuzhe Sha, Ziqiang Li, Lizhi Xiong + 1 more

cs.CRcs.CV

TLDR

SET detects backdoors in T2I models by scaling cross-attention, revealing response divergence between benign and malicious inputs, outperforming prior methods.

Key contributions

  • Introduces Cross-Attention Scaling Response Divergence (CSRD) for backdoor detection in T2I models.
  • Proposes SET, an input-level framework using multi-scale cross-attention perturbations.
  • Learns a benign response space from clean samples, detecting deviations without attack knowledge.
  • Significantly outperforms baselines, improving AUROC by 9.1% and ACC by 6.5% on stealthy attacks.

Why it matters

This paper addresses a critical security gap in text-to-image diffusion models, where stealthy backdoor attacks are hard to detect. SET offers a robust, input-level solution that doesn't require prior attack knowledge or model retraining. Its practical effectiveness makes T2I models safer for deployment.

Original Abstract

Text-to-image (T2I) diffusion models have achieved remarkable success in image synthesis, but their reliance on large-scale data and open ecosystems introduces serious backdoor security risks. Existing defenses, particularly input-level methods, are more practical for deployment but often rely on observable anomalies that become unreliable under stealthy, semantics-preserving trigger designs. As modern backdoor attacks increasingly embed triggers into natural inputs, these methods degrade substantially, raising a critical question: can more stable, implicit, and trigger-agnostic differences between benign and backdoor inputs be exploited for detection? In this work, we address this challenge from an active probing perspective. We introduce controlled scaling perturbations on cross-attention and uncover a novel phenomenon termed Cross-Attention Scaling Response Divergence (CSRD), where benign and backdoor inputs exhibit systematically different response evolution patterns across denoising steps. Building on this insight, we propose SET, an input-level backdoor detection framework that constructs response-offset features under multi-scale perturbations and learns a compact benign response space from a small set of clean samples. Detection is then performed by measuring deviations from this learned space, without requiring prior knowledge of the attack or access to model training. Extensive experiments demonstrate that SET consistently outperforms existing baselines across diverse attack methods, trigger types, and model settings, with particularly strong gains under stealthy implicit-trigger scenarios. Overall, SET improves AUROC by 9.1% and ACC by 6.5% over the best baseline, highlighting its effectiveness and robustness for practical deployment.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.