ArXiv TLDR

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

🐦 Tweet
2604.18487

Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi + 3 more

cs.CLcs.AI

TLDR

New benchmark reveals frontier models' safety refusals are easily bypassed by stylistic prompt transformations, showing weak generalization in current safety techniques.

Key contributions

  • Introduces Adversarial Humanities Benchmark (AHB) to test model safety against stylistic prompt changes.
  • AHB transforms harmful prompts into humanities styles, revealing weak safety generalization.
  • Transformed prompts increased attack success rate from 3.84% to 55.75% across 31 frontier models.
  • Identifies Chemical, Biological, Radiological, and Nuclear (CBRN) as the highest systemic risk.

Why it matters

This paper highlights a significant vulnerability in frontier model safety: current techniques lack stylistic robustness. It demonstrates that simple transformations can bypass safety refusals, indicating a fundamental gap in models' understanding of harmful intent. This calls for more robust and generalizable safety measures.

Original Abstract

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.