Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval
TLDR
Prism-Reranker is a new reranker that provides not just relevance scores but also specific contributions and evidence passages, optimizing retrieval for agents and RAG.
Key contributions
- Emits contribution statements and evidence passages, going beyond scalar relevance scores for documents.
- Built on Qwen3.5, available in four sizes (0.8B, 2B, 4B, 9B) for various deployment scales.
- Trained with a hybrid objective combining point-wise distillation and supervised fine-tuning.
- Achieves strong results on BEIR QA and improves existing rerankers, e.g., Qwen3-Reranker-4B's NDCG@10 by +1.54.
Why it matters
Traditional rerankers waste tokens by forcing agents to process entire documents. Prism-Reranker solves this by providing concise contributions and evidence, making RAG and autonomous agents more efficient and effective. This advancement is crucial for optimizing LLM context usage.
Original Abstract
Modern retrieval pipelines increasingly serve downstream consumers like retrieval-augmented generation (RAG) and autonomous agents that need more than a scalar relevance score. A reranker that only tells the caller "how relevant" forces the agent to dump entire documents into the language-model context, wasting tokens on tangential passages and boilerplate. We introduce Prism-Reranker, a family of reranker models built on Qwen3.5 at four sizes (0.8B, 2B, 4B, 9B) that goes beyond scalar scoring. In addition to the standard yes/no relevance judgement, whenever the verdict is yes the model emits (i) a contribution statement summarizing how the document helps the query, and (ii) an evidence passage: a self-contained rewrite that preserves every query-relevant signal while discarding noise. Prism-Reranker is trained with a hybrid objective combining point-wise distillation from a strong commercial reranker API with supervised fine-tuning on contribution and evidence targets. We curate training data from KaLM-Embedding's open-source aggregation, augmented with real web documents retrieved via commercial search APIs for open-domain queries and LLM-synthesized variants, and rewrite a portion of queries into keyword-style reformulations to adapt the model to agent-issued traffic. To reconcile inconsistent labels across open corpora and obtain crisp binary supervision, we relabel data with an LLM-as-Judge ensemble aggregating votes from five frontier LLMs. On a QA subset of BEIR and on an LLM-judged evaluation of contribution and evidence quality, Prism-Reranker attains solid results across all four sizes. We further show that the same recipe extends existing LLM-based rerankers, augmenting Qwen3-Reranker-4B with contribution and evidence capabilities while improving its average BEIR-QA NDCG@10 by +1.54 over the base model. Model weights, training recipe, and evaluation suite are released.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.