Measuring and Mitigating Persona Distortions from AI Writing Assistance
Paul Röttger, Kobi Hackenburg, Hannah Rose Kirk, Christopher Summerfield
TLDR
AI writing assistance distorts writer personas, making them seem more opinionated and privileged, even though users prefer the AI-generated text.
Key contributions
- Large-scale experiments (N=2,939 writers, N=11,091 readers) show AI distorts writer personas.
- AI-assisted writers appear more opinionated, competent, positive, and from more privileged demographics.
- Writers object to observed distortions but still prefer AI-generated text over their own.
- Mitigation via reward models reduced distortions but decreased user acceptance, highlighting a trade-off.
Why it matters
This paper reveals that AI writing tools significantly alter how writers are perceived, impacting their perceived beliefs and identity. These pervasive distortions, even with user awareness, have critical implications for public discourse, trust, and democratic processes as AI adoption grows. The findings highlight a complex challenge in balancing AI utility with faithful representation.
Original Abstract
Hundreds of millions of people use artificial intelligence (AI) for writing assistance. Here, we evaluated how AI writing assistance distorts writer personas - their perceived beliefs, personality, and identity. In three large-scale experiments, writers (N=2,939) wrote political opinion paragraphs with and without AI assistance. Separate groups of readers (N=11,091) blindly evaluated these paragraphs across 29 socially salient dimensions of reader perception, spanning political opinion, writing quality, writer personality, emotions, and demographics. AI writing assistance produced persona distortions across all dimensions: with AI, writers seemed more opinionated, competent, and positive, and their perceived demographic profile shifted towards more privileged groups. Writers objected to many of the observed distortions, yet continued to prefer AI-assisted text even when made aware of them. We successfully mitigated objectionable persona distortions at the model level by training reward models on our experimental data (10,008 paragraphs, 2,903,596 ratings) to steer AI outputs towards faithful representation of writer stance. However, this came at a cost to user acceptance, suggesting an entanglement between desirable and undesirable properties of AI writing assistance that may be difficult to resolve. Together, our findings demonstrate that persona distortions from AI writing assistance are pervasive and persistent even under realistic conditions of human oversight, which carries implications for public discourse, trust, and democratic deliberation that scale with AI adoption.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.