ArXiv TLDR

Pop Quiz Attack: Black-box Membership Inference Attacks Against Large Language Models

🐦 Tweet
2605.06423

Zeyuan Chen, Yihan Ma, Xinyue Shen, Michael Backes, Yang Zhang

cs.CR

TLDR

Introduces PopQuiz, a black-box membership inference attack that turns data into quizzes to reveal if LLMs memorized specific training examples.

Key contributions

  • Proposes PopQuiz, a black-box membership inference attack for LLMs.
  • Converts target data into multiple-choice questions to infer model membership.
  • Achieves 0.873 ROC-AUC, outperforming prior methods by 20.6% across 6 LLMs.
  • Analyzes attack factors and finds defenses reduce but don't eliminate risk.

Why it matters

This paper introduces a novel and effective black-box attack, PopQuiz, demonstrating significant privacy vulnerabilities in widely used LLMs like GPT-4o and LLaMA2. It highlights that current defenses are insufficient, urging further research into robust privacy-preserving LLM training.

Original Abstract

Large language models (LLMs) show strong performance across many applications, but their ability to memorize and potentially reveal training data raises serious privacy concerns. We introduce the PopQuiz Attack, a black-box membership inference attack that tests whether a model can recall specific training examples. The core idea is to turn target data into quiz-style multiple-choice questions and infer membership from the model's answers. Across six widely used LLMs (GPT-3.5, GPT-4o, LLaMA2-7b, LLaMA2-13b, Mistral-7b, and Vicuna-7b) and four datasets, our method achieves an average ROC-AUC of 0.873 and outperforms existing approaches by 20.6%. We further analyze factors affecting attack success, including query complexity, data type, data structure, and training settings. We also evaluate instruction-based, filter-based, and differential privacy-based defenses, which reduce performance but do not eliminate the risk. Our results highlight persistent privacy vulnerabilities in modern LLMs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.