CyberCertBench: Evaluating LLMs in Cybersecurity Certification Knowledge
Gustav Keppler, Ghada Elbez, Veit Hagenmeyer
TLDR
CyberCertBench evaluates LLMs in cybersecurity certification knowledge, showing frontier models excel in general IT security but struggle with vendor-specific details.
Key contributions
- Introduces CyberCertBench, a new benchmark for evaluating LLMs on cybersecurity certification knowledge.
- Evaluates LLMs across IT, OT, and specialized cybersecurity standards using multiple-choice questions.
- Proposes a Proposer-Verifier framework for generating interpretable explanations of LLM performance.
- Finds frontier LLMs excel in general IT security but struggle with vendor-specific or formal standards like IEC 62443.
Why it matters
This paper provides a crucial benchmark for assessing LLMs' practical cybersecurity knowledge, highlighting their strengths and weaknesses against industry standards. It reveals that while LLMs handle general IT security well, they fall short on specialized, vendor-specific, or formal standards. This informs future LLM development for critical cybersecurity applications.
Original Abstract
The rapid evolution and use of Large Language Models (LLMs) in professional workflows require an evaluation of their domain-specific knowledge against industry standards. We introduceCyberCertBench, a new suite of Multiple Choice Question Answering (MCQA) benchmarks derived from industry recognized certifications. CyberCertBench evaluates LLM domain knowledgeagainst the professional standards of Information Technology cybersecurity and more specializedareas such as Operational Technology and related cybersecurity standards. Concurrently, we propose and validate a novel Proposer-Verifier framework, a methodology to generate interpretable,natural language explanations for model performance. Our evaluation shows that frontier modelsachieve human expert level in general networking and IT security knowledge. However, theiraccuracy declines in questions that require vendor-specific nuances or knowledge in formalstandards, like, e.g., IEC 62443. Analysis of model scaling trend and release date demonstratesremarkable gains in parameter efficiency, while recent larger models show diminishing returns.Code and evaluation scripts are available at: https://github.com/GKeppler/CyberCertBench.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.