ArXiv TLDR

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

🐦 Tweet
2605.00699

Xutao Mao, Liangjie Zhao, Tao Liu, Xiang Zheng, Hongying Zan + 1 more

cs.CR

TLDR

STARE is an RL framework that red-teams VLMs by attacking the image generation trajectory, revealing temporal toxicity vulnerabilities and improving attack success.

Key contributions

  • Introduces STARE, a hierarchical RL framework for white-box T2I and black-box VLM toxicity attacks.
  • Achieves 68% higher attack success rate by targeting the image denoising trajectory itself.
  • Discovers "Optimization-Induced Phase Alignment," where toxicity emerges in predictable temporal phases.
  • Shows that perturbing specific phases can selectively suppress different toxicity categories.

Why it matters

This paper transforms VLM toxicity analysis from a black-box problem into a predictable, phase-aware process. By identifying specific temporal windows where harms emerge, it offers both a powerful attack engine and a foundation for developing targeted, phase-aware safety mechanisms. This is crucial for building more robust and secure multi-modal AI.

Original Abstract

Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68\% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit diffuse toxicity, whereas adversarial optimization concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. Targeted perturbations of either window selectively suppress different toxicity categories, indicating that this temporal structure is a genuine causal handle rather than a side effect of the hierarchical design. The phenomenon turns toxicity formation from a chaotic process into a small set of predictable vulnerability windows, providing both a potent attack engine and a basis for phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.