Correct Code, Vulnerable Dependencies: A Large Scale Measurement Study of LLM-Specified Library Versions
Chengjie Wang, Jingzheng Wu, Xiang Ling, Tianyue Luo, Chen Zhao
TLDR
LLMs frequently specify vulnerable and incompatible third-party library versions, a systemic issue that external constraints can mitigate.
Key contributions
- LLMs often specify vulnerable library versions (36-55% with CVEs, 62-74% Critical/High severity).
- Most vulnerabilities were known before model cutoffs, indicating a systemic bias towards risky versions.
- LLM-selected versions lead to high compatibility failures (19-63% static, 6-48% dynamic pass rates).
- Externally anchoring version constraints significantly reduces both security vulnerabilities and compatibility issues.
Why it matters
This paper uncovers a critical, previously overlooked security and compatibility risk in LLM-generated code: the selection of third-party library versions. It demonstrates a systemic bias in LLMs towards vulnerable and incompatible versions, even for known CVEs. The findings highlight the need for external version management to improve the reliability and security of LLM-assisted development.
Original Abstract
Large language models (LLMs) are now largely involved in software development workflows, and the code they generate routinely includes third-party library (TPL) imports annotated with specific version identifiers. These version choices can carry security and compatibility risks, yet they have not been systematically studied. We present the first large-scale measurement study of version-level risk in LLM-generated Python code, evaluating 10 LLMs on PinTrace, a curated benchmark of 1,000 Stack Overflow programming tasks. LLMs tend to specify version identifiers when directly prompted at 26.83%-95.18%, while down to 6.45%-59.19% in creating a manifest file directly. Among the specified versions, 36.70%-55.70% of tasks contain at least one known CVE, and 62.75%-74.51% of them carry Critical or High severity ratings. In 72.27%-91.37% of cases, the associated CVEs were publicly disclosed before the model's knowledge cutoff. The statistics show all models converge on the same small set of risky release versions, indicating a systemic bias rather than isolated model error. Static compatibility rates range from 19.70% to 63.20%, with installation failure as the dominant cause. The dynamic test cases confirm the pattern by 6.49%-48.62% pass rates. Further experiments confirm that these failures are attributable to version selection rather than code quality, and that externally anchored version constraints substantially reduce both vulnerability exposure and compatibility failures. Our findings reveal LLM version selection as a first-class, previously overlooked risk surface in LLM-based development. We disclosed these findings to the community of the evaluated models, and several confirmed the issue. All the code and dataset have been released for open science at https://github.com/dw763j/PinTrace.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.