Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

May 13, 20262605.13596

Kyo Gerrits, Rik van Noord, Ana Guerberof Arenas

cs.CL

TLDR

Automatic metrics and LLM judges poorly evaluate creativity in literary translations, often penalizing creative solutions and showing bias towards machine output.

Key contributions

AEMs and LLM-as-a-judge poorly align with professional evaluations of creativity in literary translation.
LLM-as-a-judge shows a bias, favoring machine translations and penalizing creative, culturally appropriate solutions.
Automatic evaluation tools perform worse on highly literary genres like poetry, highlighting fundamental limitations.

Why it matters

This paper exposes fundamental limitations of current automatic evaluation tools, including LLMs, in assessing creativity in literary translation. It highlights the urgent need for new metrics that value creative linguistic solutions, rather than penalizing them, to advance human-aligned translation technology.

Original Abstract

This article investigates the performance of automatic evaluation metrics (AEMs) and LLM-as-a-judge evaluation on literary translation across multiple languages, genres, and translation modalities. The aim is to assess how well these tools align with professionals when evaluating translation, creativity (creative shifts & errors), and see if they can substitute laborious manual annotations. A dataset of literary translations across three modalities (human translation, machine translation, and post-editing), three genres and three language pairs was created and annotated in detail for creativity by experienced professional literary translators. The results show that both AEMs and LLM-as-a-judge evaluations correlate poorly with professional evaluations on creativity, with LLM-as-a-judge showing a systematic bias in favour of machine-translated texts and penalising creative and culturally appropriate solutions. Moreover, performance is consistently worse for more literary genres such as poetry. This highlights fundamental limitations of current automatic evaluation tools for literary translation and the need to create new tools that do not frequently consider out of routine translations as errors.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers