ArXiv TLDR

NovBench: Evaluating Large Language Models on Academic Paper Novelty Assessment

🐦 Tweet
2604.11543

Wenqing Wu, Yi Zhao, Yuzhuo Wang, Siyou Li, Juexi Shao + 2 more

cs.CLcs.AIcs.DLcs.IR

TLDR

NovBench is introduced as the first benchmark to evaluate large language models' ability to assess research paper novelty, revealing current LLMs' limitations.

Key contributions

  • Introduces NovBench, the first large-scale benchmark for evaluating LLMs on academic paper novelty assessment.
  • Comprises 1,684 paper-review pairs from an NLP conference, using both paper intros and expert evaluations.
  • Proposes a four-dimensional framework (Relevance, Correctness, Coverage, Clarity) for assessing LLM novelty evaluations.
  • Reveals current LLMs have limited novelty understanding and fine-tuned models struggle with instruction adherence.

Why it matters

Assessing research novelty is crucial for peer review, but human reviewers are overwhelmed. This paper provides a critical tool, NovBench, to systematically evaluate LLMs for this task. Its findings highlight significant gaps in current LLM capabilities, guiding future research towards more effective AI-assisted peer review systems.

Original Abstract

Novelty is a core requirement in academic publishing and a central focus of peer review, yet the growing volume of submissions has placed increasing pressure on human reviewers. While large language models (LLMs), including those fine-tuned on peer review data, have shown promise in generating review comments, the absence of a dedicated benchmark has limited systematic evaluation of their ability to assess research novelty. To address this gap, we introduce NovBench, the first large-scale benchmark designed to evaluate LLMs' capability to generate novelty evaluations in support of human peer review. NovBench comprises 1,684 paper-review pairs from a leading NLP conference, including novelty descriptions extracted from paper introductions and corresponding expert-written novelty evaluations. We focus on both sources because the introduction provides a standardized and explicit articulation of novelty claims, while expert-written novelty evaluations constitute one of the current gold standards of human judgment. Furthermore, we propose a four-dimensional evaluation framework (including Relevance, Correctness, Coverage, and Clarity) to assess the quality of LLM-generated novelty evaluations. Extensive experiments on both general and specialized LLMs under different prompting strategies reveal that current models exhibit limited understanding of scientific novelty, and that fine--tuned models often suffer from instruction-following deficiencies. These findings underscore the need for targeted fine-tuning strategies that jointly improve novelty comprehension and instruction adherence.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.