Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation

May 7, 20262605.06324

Florian A. D. Burnat, Brittany I. Davidson

cs.CRcs.CYcs.LG

TLDR

This paper shows how online safety metrics can be gamed by platforms using content variants and proposes a robust "semantic-envelope" metric to certify true harm reduction.

Key contributions

Demonstrates that online safety metrics are manipulable if equivalent harmful content variants score differently.
Proposes the "semantic-envelope lift" as a unique, conservative, and robust metric repair against manipulation.
Introduces a "class-stratified certificate" to genuinely certify harm reduction across various platform strategies.
Validates the proposed metric and certificate using exhaustive enumeration, SMT encoding, and bounded MDPs.

Why it matters

Online safety regulations increasingly rely on scalar metrics, which platforms can strategically manipulate without reducing actual harm. This paper introduces a robust metric and certification method to ensure genuine harm reduction, preventing platforms from gaming the system and improving the effectiveness of online safety audits.

Original Abstract

Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers