ArXiv TLDR

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios

🐦 Tweet
2605.07830

Taein Lim, Seongyong Ju, Munhyeok Kim, Hyunjun Kim, Hoki Kim

cs.CRcs.AI

TLDR

CyBiasBench reveals LLM cyber-attack agents exhibit inherent biases, concentrating efforts on specific attack families regardless of prompts.

Key contributions

  • Introduced CyBiasBench, a 630-session benchmark to quantify attack-selection bias in LLM cyber agents.
  • Found explicit, agent-specific biases in attack family allocation, independent of attack success rates.
  • Identified a "bias momentum effect" where agents resist steering, without improving attack performance.
  • Released an interactive dashboard and reproducibility artifact to facilitate future research.

Why it matters

This paper highlights a critical issue: LLM cyber agents possess inherent attack biases that are hard to mitigate. Understanding these biases is crucial for developing more reliable and controllable AI agents in cybersecurity, preventing unintended attack patterns and improving agent safety.

Original Abstract

Large language models (LLMs) are increasingly deployed as autonomous agents in offensive cybersecurity. In this paper, we reveal an interesting phenomenon: different agents exhibit distinct attack patterns. Specifically, each agent exhibits an attack-selection bias, disproportionately concentrating its efforts on a narrow subset of attack families regardless of prompt variations. To systematically quantify this behavior, we introduce CyBiasBench, a comprehensive 630-session benchmark that evaluates five agents on three targets and four prompt conditions with ten attack families. We identify explicit bias across agents, with different dominant attack families and varying entropy levels in their attack-family allocation distributions. Such bias is better characterized as a trait of the agents, rather than a factor associated with the attack success rate. Furthermore, our experiments reveal a bias momentum effect, where agents resist explicit steering toward attack families that conflict with their bias. This forced distribution shift does not yield measurable improvements in attack performance. To ensure reproducibility and facilitate future research, we release an interactive result dashboard at https://trustworthyai.co.kr/CyBiasBench/ and a reproducibility artifact with aggregated session-level statistics and full evaluation scripts at https://github.com/Harry24k/CyBiasBench.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.