Fight Poison with Poison: Enhancing Robustness in Few-shot Machine-Generated Text Detection with Adversarial Training

May 4, 20262605.02374

cs.CRcs.CL

TLDR

REACT uses adversarial training and a RAG-guided attacker to boost few-shot MGT detection robustness against humanizing attacks.

Key contributions

Proposes REACT, an adversarial training framework for robust few-shot MGT detection.
Employs a RAG-guided attacker to craft highly human-like adversarial examples.
Detector learns from adversaries via a contrastive objective, enhancing robustness.
Improves detection F1 by 4.95 points and reduces attack success rate by 3.66%.

Why it matters

Machine-generated text detection is crucial for online information, yet existing methods struggle with limited data and adversarial attacks. REACT provides a robust solution by co-evolving an attacker and detector, significantly improving performance and resilience.

Original Abstract

Machine-generated text (MGT) detection is critical for regulating online information ecosystems, yet existing detectors often underperform in few-shot settings and remain vulnerable to adversarial, humanizing attacks. To build accurate and robust detectors under limited supervision, we adopt a threat-modeling perspective and study detector vulnerabilities from an attacker's viewpoint under an output-only black-box setting. Motivated by this perspective, we propose RAG-GuidEd Attacker Strengthens ConTrastive Few-shot Detector (REACT), an adversarial training framework that improves both few-shot detection performance and robustness against attacks. REACT couples a humanization-oriented attacker with a target detector: the attacker leverages retrieval-augmented generation (RAG) to craft highly human-like adversarial examples to evade detection, while the detector learns from these adversaries with a contrastive objective to stabilize few-shot representation learning and enhance robustness. We alternately update the attacker and the detector to enable their co-evolution. Experiments on 4 datasets with 4 shot sizes and 3 random seeds show that REACT improves average detection F1 by 4.95 points over 8 state-of-the-art (SOTA) detectors and reduces the average attack success rate (ASR) under 4 strong attacks by 3.66 percentage points.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers