On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

May 6, 20262605.04901

Zhengyi Li, Yakai Wang, Kang Yang, Yu Yu, Jiaping Gui + 4 more

cs.CRcs.AI

TLDR

This paper demonstrates a novel attack that bypasses the shuffling defense in Transformer secure inference, enabling model weight extraction.

Key contributions

Identifies a critical vulnerability in the shuffling defense for Transformer secure inference.
Proposes an attack that aligns differently shuffled activations to a common permutation.
Successfully extracts Transformer model weights (Pythia-70m, GPT-2) with high accuracy.
Demonstrates the attack with a low query cost, making it practical for adversaries.

Why it matters

Secure inference for Transformers is crucial, but existing efficiency solutions like the shuffling defense are shown to be vulnerable. This work highlights a significant security flaw, demonstrating that model weights can still be extracted. It calls for stronger defenses to protect sensitive AI models.

Original Abstract

For Transformer models, cryptographically secure inference ensures that the client learns only the final output, while the server learns nothing about the client's input. However, securely computing nonlinear layers remains a major efficiency bottleneck due to the substantial communication rounds and data transmission required. To address this issue, prior works reveal intermediate activations to the client, allowing nonlinear operations to be computed in plaintext. Although this approach significantly improves efficiency, exposing activations enables adversaries to extract model weights. To mitigate this risk, existing works employ a shuffling defense that reveals only randomly permuted activations to the client. In this work, we show that the shuffling defense is not as robust as previously claimed. We propose an attack that aligns differently shuffled activations to a common permutation and subsequently exploits them to extract model weights. Experiments on Pythia-70m and GPT-2 demonstrate that the proposed attack can align shuffled activations with mean squared errors ranging from $10^{-9}$ to $10^{-6}$. With a query cost of approximately \$1, the adversary can recover model weights with L1-norm differences ranging from $10^{-4}$ to $10^{-2}$ compared to the oracle weights.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers