RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

April 13, 20262604.11546

Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Shiyu Liang

cs.CR

TLDR

RLSpoofer is a lightweight, black-box RL-based attack that exposes the fragility of LLM watermarking with minimal data, achieving high spoof success.

Key contributions

RLSpoofer is a lightweight, black-box RL attack evaluating LLM watermark spoofing resilience.
It requires only 100 human-watermarked paraphrase pairs and no access to watermarking internals.
Achieves 62% spoof success with minimal semantic shift, outperforming baselines by 10x.
Exposes fragile spoofing resistance of current LLM watermarking paradigms.

Why it matters

This paper introduces a critical, lightweight method to evaluate LLM watermark robustness. By demonstrating high spoof success with minimal data, it exposes a significant vulnerability in current watermarking techniques. This highlights the urgent need for more resilient AI-generated text detection.

Original Abstract

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers