Efficient Retrieval Scaling with Hierarchical Indexing for Large Scale Recommendation

April 14, 20262604.12965

Dongqi Fu, Kaushik Rangadurai, Haiyu Lu, Yunchen Pu, Siyang Yuan + 11 more

cs.IR

TLDR

This paper proposes a hierarchical indexing method using cross-attention and residual quantization to efficiently scale large-scale retrieval models for recommendations.

Key contributions

Proposes a novel hierarchical indexing method for large-scale retrieval models.
Uses cross-attention and residual quantization to jointly learn the hierarchical index.
Deployed at Meta, supporting daily advertisement recommendations for billions of users.
Intermediate nodes reveal high-quality data, enabling "test-time training" for improved inference.

Why it matters

This paper addresses the critical challenge of efficiently deploying large-scale foundational retrieval models. By introducing a hierarchical indexing approach, it significantly improves retrieval scalability while preserving exactness. Its real-world deployment at Meta and novel insights into "test-time training" make it a substantial contribution to recommendation systems.

Original Abstract

The increase in data volume, computational resources, and model parameters during training has led to the development of numerous large-scale industrial retrieval models for recommendation tasks. However, effectively and efficiently deploying these large-scale foundational retrieval models remains a critical challenge that has not been fully addressed. Common quick-win solutions for deploying these massive models include relying on offline computations (such as cached user dictionaries) or distilling large models into smaller ones. Yet, both approaches fall short of fully leveraging the representational and inference capabilities of foundational models. In this paper, we explore whether it is possible to learn a hierarchical organization over the memory of foundational retrieval models. Such a hierarchical structure would enable more efficient search by reducing retrieval costs while preserving exactness. To achieve this, we propose jointly learning a hierarchical index using cross-attention and residual quantization for large-scale retrieval models. We also present its real-world deployment at Meta, supporting daily advertisement recommendations for billions of Facebook and Instagram users. Interestingly, we discovered that the intermediate nodes in the learned index correspond to a small set of high-quality data. Fine-tuning the model on this set further improves inference performance, and concretize the concept of "test-time training" within the recommendation system domain. We demonstrate these findings using both internal and public datasets with strong baseline comparisons and hope they contribute to the community's efforts in developing the next generation of foundational retrieval models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers