Do LLMs Follow Their Own Rules? A Reflexive Audit of Self-Stated Safety Policies

April 10, 20262604.09189

cs.CLcs.AIcs.LG

TLDR

LLMs often fail to follow their own stated safety policies, revealing a measurable gap between what they say and what they do.

Key contributions

Introduced SNCA, a framework to audit LLMs' self-stated safety policies against their actual behavior.
SNCA extracts, formalizes (Absolute, Conditional, Adaptive), and measures policy compliance.
Found systematic gaps: models claiming absolute refusal often comply with harmful prompts.
Reasoning models are consistent but fail to articulate policies for 29% of categories.

Why it matters

This paper introduces a novel method to audit LLMs' internal safety consistency, revealing critical gaps between their stated rules and actual behavior. This is vital for developing more reliable and aligned LLMs, complementing existing external safety benchmarks.

Original Abstract

LLMs internalize safety policies through RLHF, yet these policies are never formally specified and remain difficult to inspect. Existing benchmarks evaluate models against external standards but do not measure whether models understand and enforce their own stated boundaries. We introduce the Symbolic-Neural Consistency Audit (SNCA), a framework that (1) extracts a model's self-stated safety rules via structured prompts, (2) formalizes them as typed predicates (Absolute, Conditional, Adaptive), and (3) measures behavioral compliance via deterministic comparison against harm benchmarks. Evaluating four frontier models across 45 harm categories and 47,496 observations reveals systematic gaps between stated policy and observed behavior: models claiming absolute refusal frequently comply with harmful prompts, reasoning models achieve the highest self-consistency but fail to articulate policies for 29% of categories, and cross-model agreement on rule types is remarkably low (11%). These results demonstrate that the gap between what LLMs say and what they do is measurable and architecture-dependent, motivating reflexive consistency audits as a complement to behavioral benchmarks.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers