ArXiv TLDR

Patch2Vuln: Agentic Reconstruction of Vulnerabilities from Linux Distribution Binary Patches

🐦 Tweet
2605.06601

Isaac David, Arthur Gervais

cs.CRcs.AI

TLDR

Patch2Vuln uses a language model agent to reconstruct vulnerabilities from Linux binary patches, evaluated on Ubuntu packages.

Key contributions

  • Introduces Patch2Vuln, an agentic pipeline for reconstructing vulnerabilities from binary patches.
  • Extracts old/new ELF pairs, diffs them, and uses an offline agent for auditing and classification.
  • Evaluated on 25 Ubuntu `.deb` package pairs, localizing security functions in 10 of 20 cases.
  • Identifies binary-diff coverage and local behavioral validation as limiting components.

Why it matters

This paper introduces a novel agentic pipeline, Patch2Vuln, for reconstructing vulnerabilities directly from binary patches, crucial when source code is unavailable. It highlights the potential of AI in binary security analysis while also pinpointing key areas for future research, like improved binary diffing.

Original Abstract

Security updates create a short but important window in which defenders and attackers can compare vulnerable and patched software. Yet in many operational settings, the most accessible artifacts are binary packages rather than source patches or advisory text. This paper asks whether a language-model agent, restricted to local binary-derived evidence, can reconstruct the security meaning of Linux distribution updates. Patch2Vuln is a local, resumable pipeline that extracts old/new ELF pairs, diffs them with Ghidra and Ghidriff, ranks changed functions, builds candidate dossiers, and asks an offline agent to produce a preliminary audit, bounded validation plan, and final audit. We evaluate Patch2Vuln on 25 Ubuntu `.deb` package pairs: 20 security-update pairs and five negative controls, all manually adjudicated against private source-patch and binary-function ground truth. The agent localizes a verified security-relevant patch function in 10 of 20 security pairs and assigns an accepted final root-cause class in 11 of 20. Oracle diagnostics show that six security pairs fail before model reasoning because the binary differ or ranker omits the right function, with one additional context-export miss. A separate bounded validation pass produces two target-level minimized behavioral old/new differentials, both for tcpdump, but no crash, timeout, sanitizer finding, or memory-corruption proof; all five negative controls are classified as unknown and produce no validation differentials. These results support agentic vulnerability reconstruction from binary patches as a useful research target while showing that binary-diff coverage and local behavioral validation remain the limiting components.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.