ArXiv TLDR

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

🐦 Tweet
2604.19533

Alankrit Chona, Igor Kozlov, Ambuj Kumar

cs.CRcs.AI

TLDR

New Cyber Defense Benchmark reveals current LLM agents dramatically fail at open-ended threat hunting, despite strong Q&A security performance.

Key contributions

  • Introduces Cyber Defense Benchmark for evaluating LLM agents in threat hunting.
  • Uses 106 real attack procedures from OTRF, wrapped in a Gymnasium RL environment.
  • Agents query SQLite logs to find malicious event timestamps, scored CTF-style.
  • Five frontier LLMs (Claude, GPT, Gemini, Kimi) dramatically fail, with <4% recall.

Why it matters

This paper introduces a crucial benchmark for evaluating LLMs in real-world, open-ended threat hunting, a core SecOps task. It highlights that despite strong performance on curated security Q&A, current frontier LLMs are significantly inadequate for unsupervised deployment in security operations, revealing a critical gap in their capabilities.

Original Abstract

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events. The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&amp;CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each episode presents the agent with an in-memory SQLite database of 75,000-135,000 log records produced by a deterministic campaign simulator that time-shifts and entity-obfuscates the raw recordings. The agent must iteratively submit SQL queries to discover malicious event timestamps and explicitly flag them, scored CTF-style against Sigma-rule-derived ground truth. Evaluating five frontier models - Claude Opus 4.6, GPT-5, Gemini 3.1 Pro, Kimi K2.5, and Gemini 3 Flash - on 26 campaigns covering 105 of 106 procedures, we find that all models fail dramatically: the best model (Claude Opus 4.6) submits correct flags for only 3.8% of malicious events on average, and no run across any model ever finds all flags. We define a passing score as &gt;= 50% recall on every ATT&amp;CK tactic - the minimum bar for unsupervised SOC deployment. No model passes: the leader clears this bar on 5 of 13 tactics and the remaining four on zero. These results suggest that current LLMs are poorly suited for open-ended, evidence-driven threat hunting despite strong performance on curated Q&amp;A security benchmarks.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.