Reconstruction of Personally Identifiable Information from Supervised Finetuned Models
TLDR
This paper reveals that PII can be reconstructed from supervised finetuned LLMs, proposing COVA to enhance reconstruction under prefix attacks.
Key contributions
- First study on PII reconstruction from Supervised Finetuned (SFT) large language models.
- Created new multi-turn, PII-rich Q&A datasets in medical and legal domains for evaluation.
- Introduced COVA, a novel decoding algorithm, to reconstruct PII under prefix-based attacks.
- Demonstrated that partial attacker knowledge significantly improves PII reconstruction success.
Why it matters
This paper is crucial as it uncovers significant privacy vulnerabilities in Supervised Finetuned LLMs, demonstrating how personally identifiable information can be reconstructed. It highlights the urgent need for better privacy-preserving fine-tuning methods and provides a new tool, COVA, for assessing these risks.
Original Abstract
Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric Q&A datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.