NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

April 13, 20262604.11543

Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao + 2 more

cs.CLcs.AIcs.DLcs.IR

TLDR

NovBench is introduced as the first benchmark to evaluate large language models' ability to assess research paper novelty, revealing current LLMs' limitations.

Key contributions

Introduces NovBench, the first large-scale benchmark for evaluating LLMs on academic paper novelty assessment.
Comprises 1,684 paper-review pairs from an NLP conference, using both paper intros and expert evaluations.
Proposes a four-dimensional framework (Relevance, Correctness, Coverage, Clarity) for assessing LLM novelty evaluations.
Reveals current LLMs have limited novelty understanding and fine-tuned models struggle with instruction adherence.

Why it matters

Assessing research novelty is crucial for peer review, but human reviewers are overwhelmed. This paper provides a critical tool, NovBench, to systematically evaluate LLMs for this task. Its findings highlight significant gaps in current LLM capabilities, guiding future research towards more effective AI-assisted peer review systems.

Original Abstract

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers