From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

May 11, 20262605.10834

Pedro Conde, Henrique Branquinho, Valerio Mazzone, Bruno Mendes, André Baptista + 1 more

cs.AIcs.CR

TLDR

This paper introduces a new evaluation protocol for AI pentesting agents, shifting from task completion to realistic vulnerability discovery.

Key contributions

Evaluates AI pentesting agents by validated vulnerability discovery, not just task completion.
Assesses agents in complex, real-world targets with diverse attack surfaces and vulnerabilities.
Utilizes LLM-based semantic matching and bipartite resolution for robust vulnerability scoring.
Features continuous ground-truth maintenance and efficiency metrics for realistic assessment.

Why it matters

Current AI pentesting benchmarks fail to capture real-world complexity. This new protocol offers a more realistic and operationally informative way to compare agents, crucial for developing effective offensive security AI.

Original Abstract

AI pentesting agents are increasingly credible as offensive security systems, but current benchmarks still provide limited guidance on which will perform best in real-world targets. Existing evaluation protocols assess and optimize for predefined goals such as capture-the-flag, remote code execution, exploit reproduction, or trajectory similarity, in simplified or narrow settings. These tools are valuable for measuring bounded capabilities, yet they do not adequately capture the complexity, open-ended exploration, and strategic decision-making required in realistic pentesting. In this paper, we present a practical evaluation protocol that shifts assessment from task completion to validated vulnerability discovery, allowing evaluation in sufficiently complex targets spanning multiple attack surfaces and vulnerability classes. The protocol combines structured ground-truth with LLM-based semantic matching to identify vulnerabilities, bipartite resolution to score findings under realistic ambiguity, continuous ground-truth maintenance, repeated and cumulative evaluation of stochastic agents, efficiency metrics, and reduced-suite selection for sustainable experimentation. This protocol extends the state of the art by enabling a more realistic, operationally informative comparison of AI pentesting agents. To enable reproducibility, we also release expert-annotated ground truth and code for the proposed evaluation protocol: https://github.com/jd0965199-oss/ethibench.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers