Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com
Eva Agapaki, Amritpal Singh Gill
TLDR
IKEA improves dense retrieval for product search using structured negative sampling and LLM-based evaluation, achieving offline gains but highlighting online user behavior challenges.
Key contributions
- Introduces structured negative sampling using product taxonomy and attributes.
- Develops an LLM-based evaluation system to generate training data.
- Achieved +2.6% offline category accuracy on real user queries.
- Identified high zero-click rates (67% > 50%) as a key factor limiting online A/B test gains.
Why it matters
This paper demonstrates effective hard negative mining for dense retrieval in a real-world e-commerce setting. It critically highlights the disconnect between offline metrics and online user engagement, emphasizing the need to consider actual user search behavior, like zero-click patterns, to bridge this gap.
Original Abstract
Contrastive learning is a core component of modern retrieval systems, but its effectiveness heavily relies on the quality of negative examples used during training. In this work, we present a systematic approach to improving dense retrieval for IKEA product search through structured negative sampling strategies and scalable LLM-as-a-judge relevance evaluation. Building on IKEA Search Engine's late-interaction retrieval architectures, we introduce two key contributions: (1) structured negative sampling strategies that leverage product hierarchical taxonomy and product attributes to generate semantically challenging negatives, and (2) a comprehensive LLM-based evaluation methodology for generating training data. Rather than relying on sparse human annotations or random sampling, our LLM-based evaluation system allocates a score for all candidate products against each query. Our methodology achieves +2.6\% average category accuracy on offline real user query experiments on the Canada market. However, our A/B test on long-tail queries showed no statistically significant differences in user engagement metrics between the improved and baseline models ($p > 0.05$). We trace this gap to user search behavior: 67\% of popular searches exhibit zero-click rates above 50\%, indicating that a substantial proportion of search sessions result in no product engagement regardless of result ranking. These findings underscore the importance of hard negative mining but also the need for grounding training data and offline evals in real user search behavior -- including query intent distribution and zero-click patterns -- to bridge the gap between offline retrieval quality and online user engagement.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.