Beating the Style Detector: Three Hours of Agentic Research on the AI-Text Arms Race

May 4, 20262605.02620

Andreas Maier, Moritz Zaiss, Siming Bayer

cs.CLcs.LG

TLDR

Agentic AI rapidly reproduces NLP studies and demonstrates that frontier LLMs can efficiently lower their own AI-detection probability.

Key contributions

Agentic AI reproduced a full NLP study in hours, including new experiments, with human oversight.
LLMs (GPT-5.5, Claude Opus) closed 71-75% of the style gap, outperforming human post-editors.
AI-text detectors achieved high AUC (0.93-1.00), but detection mechanisms varied by LLM.
An Opus agent, with feedback, learned to significantly reduce its AI-detection probability.

Why it matters

This paper showcases the power of agentic AI to accelerate empirical research, reducing study reproduction time from weeks to hours. Crucially, it reveals that advanced LLMs can actively learn to evade AI-text detectors, escalating the AI-text arms race. This has profound implications for content authenticity and the future of human-AI collaboration.

Original Abstract

Reproducing an empirical NLP study used to take weeks. Given the released data and a modern agentic-research harness, we redo every experiment of a recent ACL\,2026 study on personal-style post-editing of LLM drafts -- and add three new ones -- with the human investigator acting only as a reviewer-in-the-loop. We reproduce all seven preregistered hypotheses and recover the paper's headline correlation between perceived self-similarity and embedding-measured self-similarity to three decimal places ($r{=}{+}0.244$, $p{<}10^{-8}$, $n{=}648$). Under a leakage-free held-out protocol, GPT-5.5 and Claude\,Opus\,4.7 close $71$--$75\,\%$ of the style gap to the same-author ceiling on $324$ paired tasks, against $24\,\%$ for the human post-edit, and beat the human post-edit on $\sim$$80\,\%$ of tasks. We then frame the same data as an AI-text detection arms race. A leave-authors-out linear SVM on LUAR-MUD embeddings reaches AUC $0.93$--$1.00$ across approaches; six diagnostics show that GPT-5.5 detection is mostly a length confound while Opus detection is a genuine stylistic signature. Given $T{=}20$ feedback iterations against the frozen detector, an Opus agent flips two of five held-out test mimics to the human half-space and shrinks every margin by an order of magnitude. With moderate effort against a known detector, a frontier LLM can already efficiently lower its own AI-detection probability. All code, $648$ mimic drafts, trained detectors, diagnostics, and adversarial trajectories are released.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers