Vaporizer: Breaking Watermarking Schemes for Large Language Model Outputs

May 8, 20262605.07481

Jonathan Hong Jin Ng, Anh Tu Ngo, Anupam Chattopadhyay

cs.CRcs.AI

TLDR

This paper introduces "Vaporizer," an attack framework demonstrating how to effectively remove watermarks from large language model outputs.

Key contributions

Analyzes state-of-the-art LLM watermarking schemes for their claimed robustness.
Introduces "Vaporizer," a suite of modified text attacks including paraphrasing and translation.
Evaluates attack efficacy by measuring watermark removal and semantic preservation (BERT score, Flesch Reading Ease).
Reveals that existing LLM watermarks can be removed with reasonable effort while preserving text meaning.

Why it matters

This research is crucial for understanding the current limitations of LLM watermarking, which aims to promote responsible AI usage. By revealing vulnerabilities, it guides the development of more secure and robust watermarking techniques for future language models.

Original Abstract

In this paper, we investigate the recent state-of-the-art schemes for watermarking large language models (LLMs) outputs. These techniques are claimed to be robust, scalable and production-grade, aimed at promoting responsible usage of LLMs. We analyse the effectiveness of these watermarking techniques against an extensive collection of modified text attacks, which perform targeted semantic changes without altering the general meaning of the text content. Our approach encompasses multiple attack strategies, which include lexical alterations, machine translation, and even neural paraphrasing. The attack efficacy is measured with two target criteria - successful removal of the watermark and preservation of semantic content. We evaluate semantic preservation through BERT scores, text complexity measures, grammatical errors, and Flesch Reading Ease indices. The experimental results reveal varying levels of effectiveness among different watermarking models, with the same underlying result that it is possible to remove the watermark with reasonable effort. This study sheds light on the strengths and weaknesses of existing LLM watermarking systems, suggesting how they should be constructed to improve security of available schemes.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers