Granite Embedding Multilingual R2 Models
Parul Awasthy, Aashka Trivedi, Yushu Yang, Ken Barker, Yulong Li + 13 more
TLDR
Granite Embedding Multilingual R2 models offer state-of-the-art dense retrieval across 200+ languages with a 32k context window.
Key contributions
- Introduces Granite Embedding Multilingual R2 models for dense retrieval across 200+ languages.
- Features a 32,768-token context window, a 64x expansion over previous R1 models.
- Includes two bi-encoder models: a 311M full-size and a 97M compact model (SOTA under 100M).
- Full-size model supports Matryoshka Representation Learning for flexible embedding dimensions.
Why it matters
These models provide state-of-the-art multilingual and cross-lingual retrieval, crucial for enterprise applications. Their expanded context window and flexible dimensionality offer significant practical advantages. Released under Apache 2.0, they enable broad research and commercial adoption.
Original Abstract
We introduce the multilingual Granite Embedding R2 models, a family of encoder-based embedding models for enterprise-scale dense retrieval across 200+ languages. Extending our English-focused R2 release, these models add enhanced support for 52 languages and programming code, a 32,768-token context window (a 64x expansion over R1), and state-of-the-art overall performance across multilingual and cross-lingual text search, code retrieval, long-document search, and reasoning retrieval datasets. The release consists of two bi-encoder models based on the ModernBERT architecture with an expanded multilingual vocabulary: a 311M-parameter full-size, and a 97M-parameter compact model built via model pruning and vocabulary selection that achieves the highest retrieval score of any open multilingual embedding model under 100M parameters. The full-size also supports Matryoshka Representation Learning for flexible embedding dimensionality. Both models are trained on enterprise-appropriate data with governance oversight, and released under the Apache 2.0 license at https://huggingface.co/collections/ibm-granite, designed to support responsible use and enable unrestricted research and enterprise adoption.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.