Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study
Nils Loose, Joseph Bienhüls, Kristoffer Hempel, Felix Mächtle, Thomas Eisenbarth
TLDR
This study finds code language models struggle to detect vulnerability-fixing commits without commit messages, lacking transferable security understanding from code changes alone.
Key contributions
- Presents a unified benchmark consolidating 20+ datasets (180k+ commits) for VFC detection.
- Finds code language models don't acquire transferable security understanding from code changes alone.
- Commit messages dominate model attention; enriching diffs doesn't shift focus to code changes.
- Code-only models miss over 93% of vulnerabilities at a 0.5% false positive rate.
Why it matters
Timely detection of vulnerability-fixing commits is critical for security, as advisories often lag. This paper exposes fundamental limitations of code-centric models, showing they struggle without commit messages. It provides a crucial benchmark and insights to guide future, more effective VFC detection research.
Original Abstract
Automated detection of vulnerability-fixing commits (VFCs) is critical for timely security patch deployment, as advisory databases lag patch releases by a median of 25 days and many fixes never receive advisories. We present a comprehensive evaluation of code language model based VFC detection through a unified framework consolidating over 20 fragmented datasets spanning more than 180000 commits. Across over 180 experiments with fine-tuned models from 125 M to 14 B parameters, we find no evidence that models acquire transferable security-relevant code understanding from code changes alone. When commit messages are available, they dominate model attention, and when removed, an attribution analysis shows that enriching diffs with additional intra-procedural semantic context does not shift model attention toward the code changes. Group-stratified evaluation exposes approximately 17% performance drops compared to random splits, while temporal splits on aggregated datasets prove unreliable due to compositional shift in the underlying project distributions. At a false positive rate of 0.5% all fine-tuned code-only models miss over 93% of vulnerabilities. Larger and more diverse training data or generative approaches show preliminary improvements but do not resolve the underlying limitations. To support future research on code-centric VFC detection, we release our unified framework and evaluation suite.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.