Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi + 3 more
TLDR
New benchmark reveals frontier models' safety refusals are easily bypassed by stylistic prompt transformations, showing weak generalization in current safety techniques.
Key contributions
- Introduces Adversarial Humanities Benchmark (AHB) to test model safety against stylistic prompt changes.
- AHB transforms harmful prompts into humanities styles, revealing weak safety generalization.
- Transformed prompts increased attack success rate from 3.84% to 55.75% across 31 frontier models.
- Identifies Chemical, Biological, Radiological, and Nuclear (CBRN) as the highest systemic risk.
Why it matters
This paper highlights a significant vulnerability in frontier model safety: current techniques lack stylistic robustness. It demonstrates that simple transformations can bypass safety refusals, indicating a fundamental gap in models' understanding of harmful intent. This calls for more robust and generalizable safety measures.
Original Abstract
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.