Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz
TLDR
Acceptance Cards introduce a four-diagnostic standard to rigorously evaluate safe fine-tuning defenses, revealing flaws in existing methods like SafeLoRA.
Key contributions
- Introduces "Acceptance Cards," a four-diagnostic standard for evaluating safe fine-tuning defenses.
- Protocol checks statistical reliability, semantic generalization, mechanism alignment, and cross-task transfer.
- Demonstrates that SafeLoRA fails the full-card pass on Gemma-2-2B-it under this new audit.
Why it matters
Current evaluations of safe fine-tuning defenses are often insufficient, leading to potentially misleading claims. Acceptance Cards provide a rigorous, multi-diagnostic standard to ensure defense claims are statistically reliable and genuinely effective. This prevents false confidence in AI safety mechanisms.
Original Abstract
Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of SafeLoRA's effectiveness. In a 46-cell audit, no cell satisfies the strict conjunction. The closest family is a near miss that passes reliability and mechanism checks where the required data are available, but fails the fresh-subject threshold, lacks a strict transfer pass, and carries a measurable deployment-accuracy cost.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.