ArXiv TLDR

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

🐦 Tweet
2604.19461

Alex Polyakov, Daniel Kuznetsov

cs.CR

TLDR

This paper introduces Involuntary In-Context Learning (IICL), an attack that exploits few-shot pattern completion to bypass safety alignment in GPT-5.4.

Key contributions

  • Introduces Involuntary In-Context Learning (IICL) to bypass LLM safety using few-shot patterns.
  • Semantic operator naming achieves a 100% bypass rate, demonstrating attack efficacy.
  • Abstract framing is critical; direct question-and-answer formats yield 0% bypass.
  • IICL achieved a 24.0% bypass rate against GPT-5.4 on HarmBench, while direct queries failed.

Why it matters

This research reveals a new vulnerability in LLM safety alignment, showing how in-context learning can override trained refusal behaviors. It highlights the need for more robust safety mechanisms beyond current behavioral training. Understanding IICL is crucial for developing future, more secure AI systems.

Original Abstract

Safety alignment in large language models relies on behavioral training that can be overridden when sufficiently strong in-context patterns compete with learned refusal behaviors. We introduce Involuntary In-Context Learning (IICL), an attack class that uses abstract operator framing with few-shot examples to force pattern completion that overrides safety training. Through 3479 probes across 10 OpenAI models, we identify the attack's effective components through a seven-experiment ablation study. Key findings: (1)~semantic operator naming achieves 100\,\% bypass rate (50/50, $p < 0.001$); (2)~the attack requires abstract framing, since identical examples in direct question-and-answer format yield 0\,\%; (3)~example ordering matters strongly (interleaved: 76\,\%, harmful-first: 6\,\%); (4)~temperature has no meaningful effect (46--56\,\% across 0.0--1.0). On the HarmBench benchmark, IICL achieves 24.0\,\% bypass $[18.6\%, 30.4\%]$ against GPT-5.4 with detailed 619-word responses, compared to 0.0\,\% for direct queries.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.