ArXiv TLDR

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

🐦 Tweet
2604.11309

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu, Yuxuan Zhou + 5 more

cs.CRcs.AIcs.CLcs.CVcs.LG

TLDR

This paper introduces Salami Slicing, a novel multi-turn jailbreak attack that exploits cumulative low-risk inputs to bypass LLM safety, achieving high success rates.

Key contributions

  • Introduces "Salami Slicing Risk," a new multi-turn jailbreak method using cumulative low-risk inputs.
  • Develops "Salami Attack," an automatic, universal framework effective across diverse LLMs and modalities.
  • Achieves over 90% attack success rate on GPT-4o and Gemini, robust against real-world defenses.
  • Proposes a defense strategy that reduces Salami Attack effectiveness by 44.8% and blocks other attacks by 64.8%.

Why it matters

This paper reveals a critical, covert multi-turn jailbreaking threat to LLMs, where seemingly harmless inputs can cumulatively lead to high-risk behaviors. It offers both a potent attack method and a practical defense. The findings are crucial for enhancing LLM security and developing more robust alignment mechanisms.

Original Abstract

Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.