ArXiv TLDR

RealVuln: Benchmarking Rule-Based, General-Purpose LLM, and Security-Specialized Scanners on Real-World Code

🐦 Tweet
2604.13764

John Pellew, Faizan Raza

cs.CR

TLDR

RealVuln introduces an open-source benchmark comparing rule-based, general-purpose LLM, and security-specialized scanners on real-world Python code.

Key contributions

  • RealVuln is the first open-source benchmark for security scanners on real-world Python code.
  • Compares 15 scanners across Rule-Based SAST, General-Purpose LLMs, and Security-Specialized categories.
  • Finds Security-Specialized scanners lead, followed by LLMs, then Rule-Based tools in F3 score.
  • All benchmark code, ground-truth data, and scoring scripts are open-source and community-driven.

Why it matters

This paper provides a crucial, open-source benchmark for evaluating security scanners on real-world vulnerabilities. Its findings highlight the superior performance of specialized tools and LLMs over traditional rule-based methods, guiding future development and adoption in software security.

Original Abstract

How do security scanners perform on real-world code? We present RealVuln, the first open-source benchmark comparing Rule-Based SAST, General-Purpose LLMs, and Security-Specialized scanners on 26 intentionally vulnerable Python repositories (educational and Capture-The-Flag applications) with 796 hand-labeled entries (676 vulnerabilities, 120 false-positive traps). We test 15 scanners (3 Rule-Based SAST, 10 General-Purpose LLM, 2 Security-Specialized) and rank them by F3 score (beta=3, weighting recall 9x over precision). A clear three-tier ranking emerges under all metrics. Under F3, the Security-Specialized scanner Kolega.Dev (73.0) leads, followed by the best General-Purpose LLM, Claude Sonnet 4.6 (51.7), which in turn scores nearly 3x higher than the best Rule-Based tool, Semgrep (17.7). Under F1, Sonnet 4.6 leads (60.9) with Kolega.Dev at 52.4. Rankings within tiers shift with beta, but the three-tier hierarchy holds across all weightings. All code, ground-truth data, scanner outputs, and scoring scripts are released under an open-source license. An interactive dashboard is at https://realvuln.kolega.dev/. RealVuln is a living benchmark: versioned, community-driven, with a roadmap toward multi-language coverage.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.