Can LLMs Act as Historians? Evaluating Historical Research Capabilities of LLMs via the Chinese Imperial Examination
Lirong Gao, Zeqing Wang, Yuyan Cai, Jiayi Deng, Yanmei Gu + 4 more
TLDR
A new benchmark, ProHist-Bench, reveals state-of-the-art LLMs struggle with complex historical reasoning, highlighting a significant proficiency gap.
Key contributions
- Developed ProHist-Bench, a novel benchmark for evaluating LLMs' historical reasoning via Chinese Imperial Examinations.
- Curated 400 challenging, expert-designed historical questions across eight dynasties with detailed rubrics.
- Evaluated 18 LLMs, revealing a significant proficiency gap in their ability to handle complex historical research tasks.
Why it matters
This paper introduces a critical benchmark for assessing LLMs' true historical reasoning capabilities beyond basic knowledge. It highlights current limitations, paving the way for developing more sophisticated, domain-specific AI for historical research.
Original Abstract
While Large Language Models (LLMs) have increasingly assisted in historical tasks such as text processing, their capacity for professional-level historical reasoning remains underexplored. Existing benchmarks primarily assess basic knowledge breadth or lexical understanding, failing to capture the higher-order skills, such as evidentiary reasoning,that are central to historical research. To fill this gap, we introduce ProHist-Bench, a novel benchmark anchored in the Chinese Imperial Examination (Keju) system, a comprehensive microcosm of East Asian political, social, and intellectual history spanning over 1,300 years. Developed through deep interdisciplinary collaboration, ProHist-Bench features 400 challenging, expert-curated questions across eight dynasties, accompanied by 10,891 fine-grained evaluation rubrics. Through a rigorous evaluation of 18 LLMs, we reveal a significant proficiency gap: even state-of-the-art LLMs struggle with complex historical research questions. We hope ProHist-Bench will facilitate the development of domain-specific reasoning LLMs, advance computational historical research, and further uncover the untapped potential of LLMs. We release ProHist-Bench at https://github.com/inclusionAI/ABench/tree/main/ProHist-Bench.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.