SURE-RAG: Sufficiency and Uncertainty-Aware Evidence Verification for Selective Retrieval-Augmented Generation

May 5, 20262605.03534

cs.CLcs.IRcs.LG

TLDR

SURE-RAG improves Retrieval-Augmented Generation by verifying evidence sufficiency, reducing unsafe answers through a transparent aggregation protocol.

Key contributions

Introduces SURE-RAG, a transparent protocol for evidence sufficiency verification in selective RAG.
Aggregates local claim-evidence relations into interpretable answer-level signals like coverage and conflict.
Achieves 0.9075 Macro-F1 on HotpotQA-RAG v3, outperforming baselines and GPT-4o judges.
Reduces unsafe answers by 37% (from 0.2588 to 0.1642) at 30% coverage.

Why it matters

RAG systems often retrieve topical but insufficient evidence, leading to unreliable answers. SURE-RAG provides a robust and auditable method to verify evidence sufficiency, significantly reducing the risk of generating unsupported or unsafe responses. This enhances the reliability and trustworthiness of RAG applications.

Original Abstract

Retrieval-augmented generation (RAG) grounds answers in retrieved passages, but retrieval is not verification: a passage can be topical and still fail to justify the answer. We frame this gap as evidence sufficiency verification for selective RAG answering: given a question, a candidate answer, and retrieved evidence, predict whether the evidence supports, refutes, or is insufficient, and abstain unless support is established. We present SURE-RAG, a transparent aggregation protocol built on the observation that evidence sufficiency is a set-level property: missing hops and unresolved conflicts cannot be detected by independent passage scoring. A shared pair-level claim-evidence verifier produces local relation distributions, which SURE-RAG aggregates into interpretable answer-level signals -- coverage, relation strength, disagreement, conflict, and retrieval uncertainty -- yielding a three-way decision and an auditable selective score. We evaluate on HotpotQA-RAG v3, a controlled multi-hop benchmark, under an artifact-aware protocol (shortcut baselines, counterfactual swaps, no-oracle checks, GPT-4o audits). Calibrated SURE-RAG reaches 0.9075 Macro-F1 (0.8951 +/- 0.0069), substantially above DeBERTa mean-pooling (0.6516) and a GPT-4o judge (0.7284), while matching a strong but opaque concat cross-encoder (0.8888 +/- 0.0109) with full auditability. Risk at 30% coverage drops from 0.2588 to 0.1642, a 37% reduction in unsafe answers. To deliberately probe the task boundary, we further contrast SURE-RAG with GPT-4o on HaluBench unsafe detection: the ranking reverses (0.3343 vs 0.7389 unsafe-F1), establishing that controlled sufficiency verification and natural hallucination detection are distinct problems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers