Beyond Precision: Importance-Aware Recall for Factuality Evaluation in Long-Form LLM Generation

April 3, 20262604.03141

Nazanin Jafari, James Allan, Mohit Iyyer

cs.CL

TLDR

This paper introduces a new framework to evaluate long-form LLM factuality by jointly measuring precision and importance-aware recall.

Key contributions

Introduces a comprehensive framework for LLM factuality, measuring both precision and recall.
Leverages external knowledge sources to construct reference facts for evaluation.
Incorporates an importance-aware weighting scheme based on relevance and salience.
Shows current LLMs excel in precision but struggle with factual recall in long-form generation.

Why it matters

Current LLM factuality evaluation overlooks recall, leading to an incomplete picture. This paper addresses that gap, highlighting factual incompleteness as a major LLM limitation. It provides a more robust method for assessing long-form generation quality.

Original Abstract

Evaluating the factuality of long-form output generated by large language models (LLMs) remains challenging, particularly when responses are open-ended and contain many fine-grained factual statements. Existing evaluation methods primarily focus on precision: they decompose a response into atomic claims and verify each claim against external knowledge sources such as Wikipedia. However, this overlooks an equally important dimension of factuality: recall, whether the generated response covers the relevant facts that should be included. We propose a comprehensive factuality evaluation framework that jointly measures precision and recall. Our method leverages external knowledge sources to construct reference facts and determine whether they are captured in generated text. We further introduce an importance-aware weighting scheme based on relevance and salience. Our analysis reveals that current LLMs perform substantially better on precision than on recall, suggesting that factual incompleteness remains a major limitation of long-form generation and that models are generally better at covering highly important facts than the full set of relevant facts.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers