HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
Peng Peng, Weiwei Lin, Wentai Wu, Xinyang Wang, Yongheng Liu
TLDR
HaS accelerates RAG by using homology-aware speculative retrieval to bypass slow full-database lookups, significantly reducing latency.
Key contributions
- Introduces HaS, a homology-aware speculative retrieval framework for RAG acceleration.
- Performs low-latency speculative retrieval over restricted scopes, then validates candidates.
- Validates by re-identifying homologous queries, bypassing slow full-database retrieval.
- Reduces retrieval latency by 23-37% with only a 1-2% accuracy drop, and accelerates multi-hop RAG.
Why it matters
RAG's effectiveness is often limited by slow retrieval from large databases. HaS addresses this by significantly speeding up the process, making LLMs more efficient and scalable. This is crucial for real-world applications where query latency is critical, especially in complex agentic RAG pipelines.
Original Abstract
Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.