Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination

April 27, 20262604.24690

Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu + 4 more

cs.CL

TLDR

A new benchmark, ProHist-Bench, reveals state-of-the-art LLMs struggle with complex historical reasoning, highlighting a significant proficiency gap.

Key contributions

Developed ProHist-Bench, a novel benchmark for evaluating LLMs' historical reasoning via Chinese Imperial Examinations.
Curated 400 challenging, expert-designed historical questions across eight dynasties with detailed rubrics.
Evaluated 18 LLMs, revealing a significant proficiency gap in their ability to handle complex historical research tasks.

Why it matters

This paper introduces a critical benchmark for assessing LLMs' true historical reasoning capabilities beyond basic knowledge. It highlights current limitations, paving the way for developing more sophisticated, domain-specific AI for historical research.

Original Abstract

While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers