Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
Lorenz Brehme, Thomas Ströhle, Ruth Breu
TLDR
This paper introduces CARE, a context-aware LLM-as-judge strategy, outperforming existing methods for evaluating multi-hop reasoning in RAG systems.
Key contributions
- Introduces Context-Aware Retriever Evaluation (CARE) for multi-hop RAG systems.
- CARE consistently outperforms other LLM-as-judge methods for multi-hop reasoning.
- Performance gains are significant for larger LLMs with longer context windows.
- Open-sourced experimental data for reproducibility.
Why it matters
Evaluating multi-hop reasoning in RAG systems is challenging, as existing methods often fall short. This paper provides a robust, context-aware evaluation strategy (CARE) that significantly improves the reliability and accuracy of RAG systems for complex queries. This is crucial for advancing RAG's real-world applicability.
Original Abstract
Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined. In this research, we use the HotPotQA, MuSiQue, and SQuAD datasets to simulate a RAG system and compare three LLM-as-judge evaluation strategies, including our proposed Context-Aware Retriever Evaluation (CARE). Our goal is to better understand how multi-hop reasoning can be most effectively evaluated in RAG systems. Experiments with LLMs from OpenAI, Meta, and Google demonstrate that CARE consistently outperforms existing methods for evaluating multi-hop reasoning in RAG systems. The performance gains are most pronounced in models with larger parameter counts and longer context windows, while single-hop queries show minimal sensitivity to context-aware evaluation. Overall, the results highlight the critical role of context-aware evaluation in improving the reliability and accuracy of retrieval-augmented generation systems, particularly in complex query scenarios. To ensure reproducibility, we provide the complete data of our experiments at https://github.com/lorenzbrehme/CARE.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.