LQM: Linguistically Motivated Multidimensional Quality Metrics for Machine Translation
Samar M. Magdy, Fakhraddin Alwajih, Abdellah El Mekki, Wesam El-Sayed, Muhammad Abdul-Mageed
TLDR
LQM introduces a linguistically-motivated, hierarchical error taxonomy for MT evaluation, validated on Arabic dialects, addressing limitations of language-agnostic metrics.
Key contributions
- Introduces LQM, a linguistically-motivated, hierarchical error taxonomy for MT evaluation.
- Created a 3,850-sentence parallel corpus spanning seven Arabic dialects for robust testing.
- Conducted expert human annotation, identifying over 6,000 error spans across 3,495 sentences.
- LQM is a language-agnostic framework, validated on Arabic, and publicly available for adaptation.
Why it matters
Existing MT evaluation often misses dialectal and cultural errors in complex languages. LQM addresses this by providing a linguistically-grounded, hierarchical error taxonomy. This framework offers a more accurate diagnostic tool, crucial for improving MT quality in diverse linguistic contexts, and is publicly available for broader adoption.
Original Abstract
Existing MT evaluation frameworks, including automatic metrics and human evaluation schemes such as Multidimensional Quality Metrics (MQM), are largely language-agnostic. However, they often fail to capture dialect- and culture-specific errors in diglossic languages (e.g., Arabic), where translation failures stem from mismatches in language variety, content coverage, and pragmatic appropriateness rather than surface form alone.We introduce LQM: Linguistically Motivated Multidimensional Quality Metrics for MT. LQM is a hierarchical error taxonomy for diagnosing MT errors through six linguistically grounded levels: sociolinguistics, pragmatics, semantics, morphosyntax, orthography, and graphetics (Figure 1). We construct a bidirectional parallel corpus of 3,850 sentences (550 per variety) spanning seven Arabic dialects (Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni), derived from conversational, culturally rich content. We evaluate six LLMs in a zero-shot setting and conduct expert span-level human annotation using LQM, producing 6,113 labeled error spans across 3,495 unique erroneous sentences, along with severity-weighted quality scores. We complement this analysis with an automatic metric (spBLEU). Though validated here on Arabic, LQM is a language-agnostic framework designed to be easily applied to or adapted for other languages. LQM annotated errors data, prompts, and annotation guidelines are publicly available at https://github.com/UBC-NLP/LQM_MT.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.