ArXiv TLDR

Learning to Route Queries to Heads for Attention-based Re-ranking with Large Language Models

🐦 Tweet
2604.24608

Yuxing Tian, Fengran Mo, Zhiqi Huang, Weixu Zhang, Jian-Yun Nie

cs.IRcs.AIcs.CL

TLDR

RouteHead learns to dynamically select optimal attention heads for LLM re-ranking, improving performance over static aggregation methods.

Key contributions

  • Introduces RouteHead, a novel query-dependent method for selecting optimal attention heads in LLM re-ranking.
  • Learns a lightweight router to dynamically map each query to a specific set of informative attention heads.
  • Generates pseudo labels for query-to-head optimality via an offline search to train the router.
  • Achieves consistent performance improvements across diverse benchmarks and various LLM backbones.

Why it matters

This paper improves LLM re-ranking by dynamically selecting attention heads, addressing the limitations of static aggregation. Its novel query-dependent routing significantly enhances relevance scoring, offering a more efficient and effective way to leverage LLMs for information retrieval tasks.

Original Abstract

Large Language Models (LLMs) have recently been explored as fine-grained zero-shot re-rankers by leveraging attention signals to estimate document relevance. However, existing methods either aggregate attention signals across all heads or rely on a statically selected subset identified by heuristic rules. This solution can be suboptimal because the informative heads can vary across queries or domains. Moreover, naively combining multiple heads can degrade performance due to redundancy or conflicting ranking signals. In this paper, we propose a query-dependent head selection method, RouteHead, for attention-based re-ranking with LLMs. Specifically, we learn a lightweight router that can map each query to an optimal head set, and relevance scores are computed by aggregating attention signals only from these heads. Since query-to-head optimal labels are unavailable, we first construct pseudo labels via an offline search. The router represents each head with a learnable embedding and represents each query using an embedding extracted from the hidden states of the frozen LLM. Then it is trained on the pseudo labels with a sparsity regularizer. Experiments on diverse benchmarks and multiple LLM backbones show that the proposed method consistently outperforms strong baselines.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.