AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
TLDR
AISafetyBenchExplorer catalogues 195 AI safety benchmarks, revealing fragmented measurement, weak governance, and a lack of standardization in LLM safety evaluation.
Key contributions
- Introduces AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks (2018-2026).
- Utilizes a multi-sheet schema to detail benchmark, metric, paper, and repository metadata.
- Reveals widespread fragmentation, weak governance, and lack of standardization in safety measurement.
- Highlights issues like English-only focus, stale repositories, and inconsistent metric definitions.
Why it matters
This paper tackles the fragmentation in AI safety benchmark evaluation. Its comprehensive catalogue reveals a lack of measurement standardization and weak governance across existing benchmarks. This resource helps researchers discover, compare, and meta-evaluate benchmarks, promoting a more coherent and rigorous safety ecosystem.
Original Abstract
The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports strong concentration around English-only evaluation (165/195), evaluation-only resources (170/195), stale GitHub repositories (137/195), stale Hugging Face datasets (96/195), and heavy reliance on arXiv preprints among benchmarks with known venue metadata. At the metric level, the catalogue shows that familiar labels such as accuracy, F1 score, safety score, and aggregate benchmark scores often conceal materially different judges, aggregation rules, and threat models. We argue that the field's main failure mode is fragmentation rather than scarcity. Researchers now have many benchmark artifacts, but they often lack a shared measurement language, a principled basis for benchmark selection, and durable stewardship norms for post publication maintenance. AISafetyBenchExplorer addresses this gap by providing a traceable benchmark catalogue, a controlled metadata schema, and a complexity taxonomy that together support more rigorous benchmark discovery, comparison, and meta-evaluation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.