ArXiv TLDR

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

🐦 Tweet
2605.07249

Youngjoon Jang, Seongtae Hong, Hyeonseok Moon, Heuiseok Lim

cs.IR

TLDR

MLAIRE introduces a new protocol and metrics to evaluate multilingual information retrieval, focusing on both semantic relevance and user language preference.

Key contributions

  • Introduces MLAIRE, a protocol for language-aware multilingual IR evaluation.
  • Disentangles semantic retrieval from user query-language preference.
  • Proposes new metrics: Language Preference Rate (LPR) and Lang-nDCG.
  • Evaluates 31 retrievers, revealing distinct behaviors in language preference.

Why it matters

Current multilingual IR evaluations neglect user language preference. MLAIRE provides a critical protocol to assess systems' ability to retrieve content in the query language, revealing distinct behaviors. This guides development of more user-centric search.

Original Abstract

Multilingual Information Retrieval is increasingly important in real-world search settings, where users issue queries over mixed-language corpora. Existing evaluations mainly reward language-agnostic semantic relevance, treating relevant passages equally regardless of language. Yet retrieval utility also depends on the language of the retrieved passages: users may prefer results they can read and verify in the query language, and query--passage language mismatch can complicate downstream grounding and answer verification in Retrieval-Augmented Generation systems. To evaluate this language-aware dimension, we introduce MLAIRE, a Multilingual Language-Aware Information Retrieval Evaluation protocol that disentangles cross-lingual semantic retrieval from query-language preference. MLAIRE constructs controlled pools with parallel passages across languages, enabling measurement of semantic retrieval accuracy and query-language preference when equivalent translations are available. We propose language-aware metrics, including Language Preference Rate (LPR) and Lang-nDCG, together with a 4-way decomposition separating semantic and query-language preference failures. Evaluating 31 dense, sparse, and late-interaction retrievers, we show that standard metrics obscure distinct behaviors: semantically strong retrievers may return correct content in a non-query language, while retrievers with stronger query-language preference may retrieve less semantically relevant passages.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.