Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training
Wenjing Duan, Qi Zhou, Yuanfan Li
TLDR
REACT uses adversarial training and a RAG-guided attacker to boost few-shot MGT detection robustness against humanizing attacks.
Key contributions
- Proposes REACT, an adversarial training framework for robust few-shot MGT detection.
- Employs a RAG-guided attacker to craft highly human-like adversarial examples.
- Detector learns from adversaries via a contrastive objective, enhancing robustness.
- Improves detection F1 by 4.95 points and reduces attack success rate by 3.66%.
Why it matters
Machine-generated text detection is crucial for online information, yet existing methods struggle with limited data and adversarial attacks. REACT provides a robust solution by co-evolving an attacker and detector, significantly improving performance and resilience.
Original Abstract
Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker's viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these adversaries with a contrastive objective to stabilize few-shot representation learning and enhance robustness. We alternately update the attacker and the detector to enable their co-evolution. Experiments on 4 datasets with 4 shot sizes and 3 random seeds show that REACT improves average detection F1 by 4.95 points over 8 state-of-the-art (SOTA) detectors and reduces the average attack success rate (ASR) under 4 strong attacks by 3.66 percentage points.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.