Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

May 11, 20262605.10575

Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz

cs.CRcs.AIcs.LG

TLDR

Acceptance Cards introduce a four-diagnostic standard to rigorously evaluate safe fine-tuning defenses, revealing flaws in existing methods like SafeLoRA.

Key contributions

Introduces "Acceptance Cards," a four-diagnostic standard for evaluating safe fine-tuning defenses.
Protocol checks statistical reliability, semantic generalization, mechanism alignment, and cross-task transfer.
Demonstrates that SafeLoRA fails the full-card pass on Gemma-2-2B-it under this new audit.

Why it matters

Current evaluations of safe fine-tuning defenses are often insufficient, leading to potentially misleading claims. Acceptance Cards provide a rigorous, multi-diagnostic standard to ensure defense claims are statistically reliable and genuinely effective. This prevents false confidence in AI safety mechanisms.

Original Abstract

Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA's effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers