LitXBench: A Benchmark for Extracting Experiments from Scientific Literature
TLDR
LitXBench is a new benchmark for extracting experimental data from scientific literature, showing frontier LLMs significantly outperform existing pipelines.
Key contributions
- Introduces LitXBench, a framework for benchmarking experiment extraction from scientific literature.
- Presents LitXAlloy, a dense benchmark with 1426 measurements from 19 alloy papers.
- Stores benchmark entries as Python objects for improved auditability and programmatic data validation.
- Finds frontier LLMs (e.g., Gemini 3.1 Pro) outperform existing extraction pipelines by up to 0.37 F1.
Why it matters
This paper introduces a crucial benchmark for advancing the extraction of experimental data from scientific literature, a key step for materials science. It demonstrates the superior performance of frontier language models in this task, paving the way for more accurate property prediction and scientific discovery.
Original Abstract
Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.