LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

April 28, 20262604.25665

Huyen Nguyen, Haoxuan Zhang, Yang Zhang, Junhua Ding, Haihua Chen

cs.CLcs.AIcs.DLcs.IR

TLDR

LLM-ReSum is a self-reflective framework that uses LLM-based evaluation to improve summary quality without model finetuning.

Key contributions

Traditional metrics (ROUGE, BLEU) poorly correlate with human summary judgments.
Neural and LLM-based evaluators show high alignment with human linguistic quality.
LLM-ReSum framework uses self-evaluation to improve summary quality without finetuning.
LLM-ReSum boosts factual accuracy by 33% and coverage by 39% in human evaluations.

Why it matters

This paper tackles the challenge of reliably evaluating LLM-generated summaries and proposes a novel self-reflective framework. It shows how LLMs can improve their own summaries, leading to more accurate and coherent outputs across diverse domains without finetuning. This advances autonomous LLM capabilities.

Original Abstract

Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2K-27K words) with over 1,500 human-annotated summaries. Our results show that traditional lexical overlap metrics (e.g., ROUGE, BLEU) exhibit weak or negative correlation with human judgments, while task-specific neural metrics and LLM-based evaluators achieve substantially higher alignment, especially for linguistic quality assessment. Leveraging these findings, we propose LLM-ReSum, a self-reflective summarization framework that integrates LLM-based evaluation and generation in a closed feedback loop without model finetuning. Across three domains, LLM-ReSum improves low-quality summaries by up to 33% in factual accuracy and 39% in coverage, with human evaluators preferring refined summaries in 89% of cases. We additionally introduce PatentSumEval, a new human-annotated benchmark for legal document summarization comprising 180 expert-evaluated summaries. All code and datasets will be released in GitHub.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers