ArXiv TLDR

ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

🐦 Tweet
2604.20666

Ioannis E. Livieris, Athanasios Koursaris, Alexandra Apostolopoulou, Konstantinos Kanaris Dimitris Tsakalidis, George Domalis

cs.CLcs.AI

TLDR

ORPHEAS is a specialized Greek-English embedding model for bilingual RAG, outperforming state-of-the-art models in cross-lingual retrieval.

Key contributions

  • Introduces ORPHEAS, a specialized Greek-English embedding model for RAG.
  • Uses knowledge graph-based fine-tuning on a multi-domain corpus for training.
  • Achieves strong cross-lingual semantic alignment and domain-specific understanding.
  • Outperforms SOTA multilingual models in Greek-English retrieval tasks.

Why it matters

Existing multilingual models often fail to capture Greek's morphological complexity and domain-specific terminology. ORPHEAS provides a highly optimized solution for Greek-English RAG, demonstrating that specialized fine-tuning can significantly boost performance. This work proves that such specialization doesn't compromise cross-lingual retrieval capabilities.

Original Abstract

Effective retrieval-augmented generation across bilingual Greek--English applications requires embedding models capable of capturing both domain-specific semantic relationships and cross-lingual semantic alignment. Existing multilingual embedding models distribute their representational capacity across numerous languages, limiting their optimization for Greek and failing to encode the morphological complexity and domain-specific terminological structures inherent in Greek text. In this work, we propose ORPHEAS, a specialized Greek--English embedding model for bilingual retrieval-augmented generation. ORPHEAS is trained with a high quality dataset generated by a knowledge graph-based fine-tuning methodology which is applied to a diverse multi-domain corpus, which enables language-agnostic semantic representations. The numerical experiments across monolingual and cross-lingual retrieval benchmarks reveal that ORPHEAS outperforms state-of-the-art multilingual embedding models, demonstrating that domain-specialized fine-tuning on morphologically complex languages does not compromise cross-lingual retrieval capability.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.