Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs
TLDR
A commit-open protocol using SAE feature traces detects dishonest LLM model substitutions, closing parallel-serve side-channels with low overhead.
Key contributions
- Introduces a commit-open protocol using Merkle trees and per-position SAE feature-trace sketches.
- Closes the parallel-serve side-channel that allows dishonest providers to evade existing verification.
- Successfully rejects 17 diverse attackers (same-family, cross-family, LoRA) across three LLM backbones.
- Adds minimal overhead, less than 2.1% to forward-only wall-clock time at batch 32.
Why it matters
Hosted LLM providers have an incentive to silently substitute models, eroding user trust. This paper offers a robust solution to verify model authenticity. By preventing dishonest substitutions, it ensures users receive the advertised model, fostering transparency and accountability in LLM services.
Original Abstract
Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.