Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer
TLDR
This paper explores rewriting AI-generated text to sound human, finding BART-large outperforms Mistral-7B in style transfer with fewer parameters.
Key contributions
- Created a parallel corpus of 25,140 AI-to-human text pairs for style transfer.
- Identified 11 measurable stylistic markers distinguishing AI from human text.
- BART-large achieved superior human-like style transfer with 17x fewer parameters than Mistral-7B.
- Highlights 'shift accuracy' as a crucial, overlooked metric in style transfer evaluation.
Why it matters
This paper tackles the challenge of making AI-generated text sound genuinely human, a less-explored area. It offers a new dataset and benchmarks models, showing smaller encoder-decoder models can outperform larger ones. The work also highlights a crucial blind spot in current style transfer evaluation.
Original Abstract
AI-generated text has become common in academic and professional writing, prompting research into detection methods. Less studied is the reverse: systematically rewriting AI-generated prose to read as genuinely human-authored. We build a parallel corpus of 25,140 paired AI-input and human-reference text chunks, identify 11 measurable stylistic markers separating the two registers, and fine-tune three models: BART-base, BART-large, and Mistral-7B-Instruct with QLoRA. BART-large achieves the highest reference similarity -- BERTScore F1 of 0.924, ROUGE-L of 0.566, and chrF++ of 55.92 -- with 17x fewer parameters than Mistral-7B. We show that Mistral-7B's higher marker shift score reflects overshoot rather than accuracy, and argue that shift accuracy is a meaningful blind spot in current style transfer evaluation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.