CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge

April 22, 20262604.20389

Gustav Keppler, Ghada Elbez, Veit Hagenmeyer

cs.CRcs.AI

TLDR

CyberCertBench evaluates LLMs in cybersecurity certification knowledge, showing frontier models excel in general IT security but struggle with vendor-specific details.

Key contributions

Introduces CyberCertBench, a new benchmark for evaluating LLMs on cybersecurity certification knowledge.
Evaluates LLMs across IT, OT, and specialized cybersecurity standards using multiple-choice questions.
Proposes a Proposer-Verifier framework for generating interpretable explanations of LLM performance.
Finds frontier LLMs excel in general IT security but struggle with vendor-specific or formal standards like IEC 62443.

Why it matters

This paper provides a crucial benchmark for assessing LLMs' practical cybersecurity knowledge, highlighting their strengths and weaknesses against industry standards. It reveals that while LLMs handle general IT security well, they fall short on specialized, vendor-specific, or formal standards. This informs future LLM development for critical cybersecurity applications.

Original Abstract

The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers