Gaming the Metric, Not the Harm: Certifying Safety Audits against Strategic Platform Manipulation
Florian A. D. Burnat, Brittany I. Davidson
TLDR
This paper shows how online safety metrics can be gamed by platforms using content variants and proposes a robust "semantic-envelope" metric to certify true harm reduction.
Key contributions
- Demonstrates that online safety metrics are manipulable if equivalent harmful content variants score differently.
- Proposes the "semantic-envelope lift" as a unique, conservative, and robust metric repair against manipulation.
- Introduces a "class-stratified certificate" to genuinely certify harm reduction across various platform strategies.
- Validates the proposed metric and certificate using exhaustive enumeration, SMT encoding, and bounded MDPs.
Why it matters
Online safety regulations increasingly rely on scalar metrics, which platforms can strategically manipulate without reducing actual harm. This paper introduces a robust metric and certification method to ensure genuine harm reduction, preventing platforms from gaming the system and improving the effectiveness of online safety audits.
Original Abstract
Online-safety regulation under the UK Online Safety Act and the EU Digital Services Act increasingly treats scalar metrics as compliance evidence. Once announced, such a metric also becomes an optimization target: a strategic platform can improve its score by routing recommendations through semantically equivalent content variants, without reducing true harm. We ask when such an audit metric can still certify a genuine reduction in harm. The protocol is modeled as a published transformation graph whose connected components form semantic classes, and the metric itself is treated as a security object. Three results follow. First, any metric that scores variants directly is manipulable as soon as two equivalent variants in a harmful class disagree in score. Second, the semantic-envelope lift, which assigns each variant the maximum score in its class, is the unique pointwise minimum among conservative classwise-constant repairs. Third, a class-stratified certificate, $H^\star(x) \le (1/\hatα) M_{\mathrm{Env}(m)}(x) + \barη$, holds for every platform strategy, with $\barη$ absorbing annotation and protocol error. We check the claims at three levels: exhaustive enumeration on a finite-state grid of mixed strategies, an SMT encoding in Z3 cross-replayed in cvc5, and a bounded single-player MDP encoded in PRISM-games. The fragile metric fails manipulation invariance and cannot support the same useful predeclared class-coverage certificate; under the envelope-level certificate, it produces large violations at every tested instance, with a large mean gaming gap across random catalogs at a fixed audit budget. The semantic-envelope metric exhibits no such violation in the tested instances.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.