Exposing LLM Safety Gaps Through Mathematical Encoding:New Attacks and Systematic Analysis

May 5, 20262605.03441

Haoyu Zhang, Mohammad Zandsalimy, Shanu Sushmita

cs.CRcs.AIcs.CLcs.LG

TLDR

New attacks encode harmful prompts as math problems, bypassing LLM safety filters with high success rates and revealing fundamental security gaps.

Key contributions

Harmful prompts encoded as mathematical problems bypass LLM safety filters (46-56% success).
Attack effectiveness relies on deep mathematical reformulation, not just surface-level notation.
Introduces a novel Formal Logic encoding, demonstrating generalizability across mathematical formalisms.
Newer models (GPT-5, GPT-5-Mini) show greater robustness but remain vulnerable to these attacks.

Why it matters

This paper exposes a critical vulnerability in LLM safety, showing harmful content can bypass filters when deeply reformulated as mathematical problems. It highlights current defenses are insufficient against sophisticated, non-obvious attacks, motivating the development of robust safety frameworks that reason about underlying structure.

Original Abstract

Large language models (LLMs) employ safety mechanisms to prevent harmful outputs, yet these defenses primarily rely on semantic pattern matching. We show that encoding harmful prompts as coherent mathematical problems -- using formalisms such as set theory, formal logic, and quantum mechanics -- bypasses these filters at high rates, achieving 46%--56% average attack success across eight target models and two established benchmarks. Crucially, the effectiveness depends not on mathematical notation itself, but on whether a helper LLM deeply reformulates the harmful content into a genuine mathematical problem: rule-based encodings that apply mathematical formatting without such reformulation perform no better than unencoded baselines. We introduce a novel Formal Logic encoding that achieves attack success comparable to Set Theory, demonstrating that this vulnerability generalizes across mathematical formalisms. Additional experiments with repeat post-processing confirm that these attacks are robust to simple prompt augmentation. Notably, newer models (GPT-5, GPT-5-Mini) show substantially greater robustness than older models, though they remain vulnerable. Our findings highlight fundamental gaps in current safety frameworks and motivate defenses that reason about mathematical structure rather than surface-level semantics.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers