ArXiv TLDR

CrackMeBench: Binary Reverse Engineering for Agents

🐦 Tweet
2605.10597

Isaac David, Arthur Gervais

cs.SEcs.AI

TLDR

CrackMeBench is a new benchmark for evaluating language models on binary reverse engineering tasks, focusing on recovering validation logic from executables.

Key contributions

  • Introduces CrackMeBench, a new benchmark for evaluating language models on binary reverse engineering tasks.
  • Focuses on deterministic binary validation, symbol-poor executables, and local tool access in a sandbox.
  • V0 includes 8 public and 12 generated tasks from C, Rust, and Go templates for robust evaluation.
  • GPT-5.5 achieved 92% pass@3 on generated tasks, significantly outperforming Claude Opus and Kimi K2.

Why it matters

This paper introduces a crucial benchmark for evaluating language models on classical binary reverse engineering, a less-specified area. CrackMeBench provides a standardized, reproducible testbed to measure progress from source-code reasoning towards autonomous binary analysis.

Original Abstract

Benchmarks for coding agents increasingly measure source-level software repair, and cybersecurity benchmarks increasingly measure broad capture-the-flag performance. Classical binary reverse engineering remains less precisely specified: given only an executable, can an agent recover validation logic and produce an input, serial, artifact, or key generator accepted by the program? We introduce CrackMeBench, a benchmark for evaluating language-model agents on educational CrackMe-style reverse-engineering tasks. CrackMeBench focuses on deterministic binary validation problems with executable oracles, symbol-poor binaries, explicit local tool access, and externally scored submissions rather than free-form explanations. The v0 benchmark combines eight public calibration CrackMes with twelve generated main-score tasks built from seeded C, Rust, and Go templates, and agents run through an equal shell interface in a no-network Linux Docker sandbox with standard reverse-engineering tools. In a three-model evaluation with a five-minute budget and three scored submissions per task, pass@3 on the generated split is 11/12 tasks (92%) for GPT-5.5, 7/12 (58%) for Claude Opus 4.7, and 5/12 (42%) for Kimi K2. The harder generated half separates the models more sharply, with pass@3 of 5/6, 2/6, and 1/6, respectively; on the eight-task public calibration split, pass@3 is 3/8, 2/8, and 1/8. CrackMeBench records pass@1 and pass@3, scored submissions, wall-clock time, command traces, tool categories, provider-reported token usage, estimated cost, and qualitative failure labels, providing a reproducible testbed for measuring progress from source-code reasoning toward autonomous binary analysis while restricting scope to educational, purpose-built programs.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.