Vulnerability Detection with Interprocedural Context in Multiple Languages: Assessing Effectiveness and Cost of Modern LLMs
Kevin Lira, Baldoino Fonseca, Davy Baía, Márcio Ribeiro, Wesley K. G. Assunção
TLDR
This study assesses modern LLMs for detecting interprocedural vulnerabilities across C, C++, and Python, finding cost-effective solutions.
Key contributions
- Empirically investigated four modern LLMs for detecting interprocedural vulnerabilities across C, C++, and Python.
- Varied interprocedural context levels (function-only, +callers, +callees) on 509 vulnerabilities.
- Gemini 3 Flash showed best cost-effectiveness for C vulnerabilities (F1 >= 0.978) at low inference cost.
- Claude Haiku 4.5 correctly identified and explained 93.6% of evaluated vulnerabilities.
Why it matters
Prior LLM vulnerability detection studies overlooked interprocedural dependencies. This research fills that gap by evaluating modern LLMs with varying context levels across multiple languages. Its findings are crucial for designing more effective and generalizable AI-assisted security analysis tools.
Original Abstract
Large Language Models (LLMs) have been a promising way for automated vulnerability detection. However, most prior studies have explored the use of LLMs to detect vulnerabilities only within single functions, disregarding those related to interprocedural dependencies. These studies overlook vulnerabilities that arise from data and control flows that span multiple functions. Thus, leveraging the context provided by callers and callees may help identify vulnerabilities. This study empirically investigates the effectiveness of detection, the inference cost, and the quality of explanations of four modern LLMs (Claude Haiku 4.5, GPT-4.1 Mini, GPT-5 Mini, and Gemini 3 Flash) in detecting vulnerabilities related to interprocedural dependencies. To do that, we conducted an empirical study on 509 vulnerabilities from the ReposVul dataset, systematically varying the level of interprocedural context (target function code-only, target function + callers, and target function + callees) and evaluating the four modern LLMs across C, C++, and Python. The results show that Gemini 3 Flash offers the best cost-effectiveness trade-off for C vulnerabilities, achieving F1 >= 0.978 at an estimated cost of $0.50-$0.58 per configuration, and Claude Haiku 4.5 correctly identified and explained the vulnerability in 93.6% of the evaluated cases. Overall, the findings have direct implications for the design of AI-assisted security analysis tools that can generalize across codebases in multiple programming languages.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.