The Pre-Training Study of Expanded-SPLADE Models on Web Document Titles
Hiun Kim, Tae Kwan Lee, Taeryun Won
TLDR
This paper studies pre-training Expanded-SPLADE models for neural IR, finding general corpora and higher learning rates improve retrieval effectiveness.
Key contributions
- Models pre-trained on general corpora with higher learning rates achieve better retrieval effectiveness.
- Strictly pruned, effective models show higher retrieval cost and variance in posting list lengths.
- Repeating general pre-training datasets has little effect on final retrieval effectiveness.
- Identifies a trade-off between retrieval effectiveness and cost in strictly pruned Expanded-SPLADE models.
Why it matters
This paper addresses transfer learning issues in pre-training neural IR models like Expanded-SPLADE. It empirically identifies optimal pre-training strategies for improved retrieval effectiveness. These insights help practitioners build more efficient sparse retrieval systems.
Original Abstract
Masked Language Modeling (MLM) pre-training is one of the primary ways to initialize Neural Information Retrieval (IR) models prior to retrieval fine-tuning. However, studies show that MLM pre-trained models have limited readiness and transfer learning issues for fine-tuning them into Neural Bi-Encoder models. This paper studies the effect of different pre-training datasets and pre-training options on the MLM pre-trained models for retrieval fine-tuning. The study focuses on the SPLADE-style model, which uses the MLM layer also at fine-tuning time. More specifically, we experimented with Expanded-SPLADE (ESPLADE) models, a specific instance of SPLADE models, and in-house web document titles are used as datasets. Pre-training, fine-tuning, and evaluation with optional test-time pruning of sparse vectors are conducted. Our observations are three-fold: First, fine-tuned models of higher retrieval effectiveness at both unpruned and most strict pruned settings are mostly pre-trained on a general corpus, and pre-trained with a higher learning rate, showing lower MLM accuracies. Second, in the most strict pruned setting, those models show higher-level retrieval cost and a higher variance in the length of the individual postings list. Third, the repetition of the general pre-training dataset does not have much effect on retrieval effectiveness. The experimentation empirically identifies the potential limitations for aligning MLM pre-training to ESPLADE fine-tuning. Also, the experimentation provides an empirical observation that, at most strict pruned settings, the retrieval effectiveness is better maintained by the higher-level retrieval cost, showing the trade-off relationship between the two in our setting.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.