ArXiv TLDR

Cochise: A Reference Harness for Autonomous Penetration Testing

🐦 Tweet
2605.11671

Andreas Happe, Jürgen Cito

cs.CRcs.AIcs.SE

TLDR

Cochise is a minimal Python reference harness for LLM-driven autonomous penetration testing, providing reusable infrastructure for research and comparison.

Key contributions

  • Introduces Cochise, a 597 LOC Python reference harness for LLM-driven autonomous penetration testing experiments.
  • Features a Planner-Executor architecture with external state and a ReAct-style executor for SSH command execution.
  • Includes replay and analysis tools, plus a corpus of JSON trajectory logs from GOAD runs for offline study.
  • Evaluated against the Game of Active Directory (GOAD) testbed, demonstrating its efficacy as a minimal harness.

Why it matters

This paper provides a crucial, minimal reference harness for autonomous penetration testing, addressing the complexity of existing LLM-driven systems. It offers a standardized, reproducible platform for comparing different models and architectures. By releasing tools and data, it significantly lowers the barrier for future research in this field.

Original Abstract

Recent work on LLM-driven autonomous penetration testing reports promising results, but existing systems often combine many architectural, prompting, and tool-integration choices, making it difficult to tell what is gained over a simple agent scaffold. We present cochise, a 597 LOC Python reference harness for autonomous penetration-testing experiments. Cochise connects an LLM-driven agent to a Linux execution host over SSH and supports controlled target environments reachable from that jump host. The prototype implements a separated Planner--Executor architecture in which long-term state is maintained outside the LLM context, while a ReAct-style executor issues commands over SSH and self-corrects based on command outputs. The scenario prompt can be adapted to different target environments. To demonstrate the efficacy of our minimal harness, we evaluate it against a live third-party testbed called Game of Active Directory (GOAD). Alongside the harness, we release replay and analysis tools: (i) cochise-replay for offline visualization of captured runs, (ii) cochise-analyze-alogs and cochise-analyze-graphs for cost, token, duration, and compromise analysis, and (iii) a corpus of JSON trajectory logs from GOAD runs, allowing researchers to study agent behavior without provisioning the 48--64 GB RAM / 190 GB storage testbed themselves. Cochise is intended not as a state-of-the-art pen-testing agent, but as reusable experimental infrastructure for comparing models, agent architectures, and penetration-testing traces.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.