ArXiv TLDR

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

🐦 Tweet
2604.03117

Chengyin Hu, Yuxian Dong, Yikun Guo, Xiang Chen, Junqi Wu + 4 more

cs.CV

TLDR

A new universal adversarial patch (UCGP) exposes critical semantic vulnerabilities in infrared vision-language models, compromising low-visibility perception.

Key contributions

  • Introduces UCGP, a universal physical adversarial patch framework for IR-VLMs.
  • Uses Curved-Grid Mesh (CGM) for continuous, low-frequency, and deployable patch generation.
  • Disrupts visual representation space to weaken cross-modal semantic alignment in IR-VLMs.
  • Achieves cross-model transferability, real-world effectiveness, and robustness against defenses.

Why it matters

This paper reveals a critical, previously overlooked vulnerability in infrared vision-language models, which are vital for low-visibility perception. By demonstrating how universal adversarial patches can compromise their semantic understanding, it highlights urgent security concerns for multimodal systems deployed in challenging environments.

Original Abstract

Infrared vision-language models (IR-VLMs) have emerged as a promising paradigm for multimodal perception in low-visibility environments, yet their robustness to adversarial attacks remains largely unexplored. Existing adversarial patch methods are mainly designed for RGB-based models in closed-set settings and are not readily applicable to the open-ended semantic understanding and physical deployment requirements of infrared VLMs. To bridge this gap, we propose Universal Curved-Grid Patch (UCGP), a universal physical adversarial patch framework for IR-VLMs. UCGP integrates Curved-Grid Mesh (CGM) parameterization for continuous, low-frequency, and deployable patch generation with a unified representation-driven objective that promotes subspace departure, topology disruption, and stealth. To improve robustness under real-world deployment and domain shift, we further incorporate Meta Differential Evolution and EOT-augmented TPS deformation modeling. Rather than manipulating labels or prompts, UCGP directly disrupts the visual representation space, weakening cross-modal semantic alignment. Extensive experiments demonstrate that UCGP consistently compromises semantic understanding across diverse IR-VLM architectures while maintaining cross-model transferability, cross-dataset generalization, real-world physical effectiveness, and robustness against defenses. These findings reveal a previously overlooked robustness vulnerability in current infrared multimodal systems.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.