Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

April 20, 20262604.18487

Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi + 3 more

cs.CLcs.AI

TLDR

New benchmark reveals frontier models' safety refusals are easily bypassed by stylistic prompt transformations, showing weak generalization in current safety techniques.

Key contributions

Introduces Adversarial Humanities Benchmark (AHB) to test model safety against stylistic prompt changes.
AHB transforms harmful prompts into humanities styles, revealing weak safety generalization.
Transformed prompts increased attack success rate from 3.84% to 55.75% across 31 frontier models.
Identifies Chemical, Biological, Radiological, and Nuclear (CBRN) as the highest systemic risk.

Why it matters

This paper highlights a significant vulnerability in frontier model safety: current techniques lack stylistic robustness. It demonstrates that simple transformations can bypass safety refusals, indicating a fundamental gap in models' understanding of harmful intent. This calls for more robust and generalizable safety measures.

Original Abstract

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers