ArXiv TLDR

From Tokens to Concepts: Leveraging SAE for SPLADE

🐦 Tweet
2604.21511

Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski

cs.IRcs.CL

TLDR

SAE-SPLADE enhances sparse IR models by replacing token vocabularies with semantic concepts via Sparse Auto-Encoders, improving efficiency.

Key contributions

  • Replaces SPLADE's token vocabulary with semantic concepts learned via Sparse Auto-Encoders (SAE).
  • Addresses issues like polysemy, synonymy, and multi-lingual/modal challenges in sparse IR.
  • Achieves retrieval performance comparable to SPLADE on in-domain and out-of-domain tasks.
  • Offers improved efficiency over traditional SPLADE models.

Why it matters

Sparse IR models are efficient but limited by token vocabularies. This work introduces SAE-SPLADE, which uses semantic concepts to overcome these limitations, offering comparable performance with improved efficiency. This makes sparse IR more robust and versatile for various applications.

Original Abstract

Learned Sparse IR models, such as SPLADE, offer an excellent efficiency-effectiveness tradeoff. However, they rely on the underlying backbone vocabulary, which might hinder performance (polysemicity and synonymy) and pose a challenge for multi-lingual and multi-modal usages. To solve this limitation, we propose to replace the backbone vocabulary with a latent space of semantic concepts learned using Sparse Auto-Encoders (SAE). Throughout this paper, we study the compatibility of these 2 concepts, explore training approaches, and analyze the differences between our SAE-SPLADE model and traditional SPLADE models. Our experiments demonstrate that SAE-SPLADE achieves retrieval performance comparable to SPLADE on both in-domain and out-of-domain tasks while offering improved efficiency.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.