Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Daria Boratyn, Damian Brzyski, Albert Leśniak, Wojciech Łukasik, Maciej Rapacz + 3 more
TLDR
This paper examines if textual similarity, measured by paragraph embeddings, remains consistent after machine translation across 28 languages.
Key contributions
- Investigates invariance of cosine similarity of paragraph embeddings under machine translation.
- Uses a corpus of 2,800 political manifestos in 28 languages translated to English.
- Proposes a non-inferiority test to assess translation's impact on semantic structure.
- Identifies 10 languages where translation preserves structure and 4 where it degrades it.
Why it matters
Machine translation is widely used, but its impact on semantic structure for downstream tasks is often unclear. This research provides a robust, corpus-agnostic framework to evaluate translation invariance, offering crucial insights for multilingual NLP applications.
Original Abstract
We investigate the extent to which cosine similarity between paragraph embeddings is invariant under machine translation, using the Manifesto Corpus of over 2,800 political party platforms in 28 languages translated to English via the EU eTranslation service. Rather than measuring translation-induced semantic shift directly we measure the stability of pairwise similarity relationships across embedding models, and use inter-model disagreement on original-language text as a calibrated invariance threshold. This yields a per-language non-inferiority test for four hypotheses about how translation interacts with embedding choice, with verdicts that distinguish languages where translation demonstrably preserves semantic structure from those where it demonstrably degrades it and from those where the available evidence does not resolve the question. The framework is corpus- and pipeline-agnostic and extends naturally to downstream tasks. Applied to our data, it identifies ten languages with translation invariance and four with detectable distortion.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.