LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation

April 20, 20262604.18490

Samar M. Magdy, Fakhraddin Alwajih, Abdellah El Mekki, Wesam El-Sayed, Muhammad Abdul-Mageed

cs.CLcs.AI

TLDR

LQM introduces a linguistically-motivated, hierarchical error taxonomy for MT evaluation, validated on Arabic dialects, addressing limitations of language-agnostic metrics.

Key contributions

Introduces LQM, a linguistically-motivated, hierarchical error taxonomy for MT evaluation.
Created a 3,850-sentence parallel corpus spanning seven Arabic dialects for robust testing.
Conducted expert human annotation, identifying over 6,000 error spans across 3,495 sentences.
LQM is a language-agnostic framework, validated on Arabic, and publicly available for adaptation.

Why it matters

Existing MT evaluation often misses dialectal and cultural errors in complex languages. LQM addresses this by providing a linguistically-grounded, hierarchical error taxonomy. This framework offers a more accurate diagnostic tool, crucial for improving MT quality in diverse linguistic contexts, and is publicly available for broader adoption.

Original Abstract

Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone.We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at https://github.com/UBC-NLP/LQM_MT.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers