ArXiv TLDR

Heimdallr: Characterizing and Detecting LLM-Induced Security Risks in GitHub CI Workflows

🐦 Tweet
2605.05969

Bonan Ruan, Yeqi Fu, Chuqi Zhang, Jiahao Liu, Jun Zeng + 1 more

cs.CRcs.SE

TLDR

Heimdallr characterizes and detects LLM-induced security risks in GitHub CI workflows, revealing a new attack surface and disclosing hundreds of vulnerabilities.

Key contributions

  • First study to characterize LLM-induced security risks in GitHub CI workflows.
  • Develops a taxonomy of high-level risk classes and concrete threat vectors.
  • Introduces Heimdallr, a hybrid analysis framework for detecting these risks.
  • Heimdallr achieved high accuracy (F1=0.917) and identified 802 vulnerable instances.

Why it matters

LLMs are increasingly integrated into CI, creating a novel and underexplored attack surface. This paper fills a critical research gap by systematically characterizing these risks. Heimdallr provides a practical solution, leading to the disclosure of hundreds of real-world vulnerabilities, significantly enhancing CI security.

Original Abstract

GitHub Continuous Integration (CI) workflows increasingly integrate Large Language Models (LLMs) to automate review, triage, content generation, and repository maintenance. This creates a new attack surface: externally controllable workflow inputs can shape LLM prompts and outputs, which may in turn affect security decisions, repository state, or privileged execution. Although LLM security and CI security have each been studied extensively, their intersection remains underexplored. In this paper, we present the first study of LLM-induced security risks in GitHub CI workflows. We characterize the problem along the full execution chain and develop a taxonomy of high-level risk classes and concrete threat vectors. To detect such risks in practice, we design Heimdallr, a hybrid analysis framework that normalizes workflows into an LLM-Workflow Property Graph (L-WPG) and combines triggerability analysis, LLM-assisted dataflow summarization, and deterministic propagation to synthesize concrete threat-vector findings. Evaluated on 300 manually annotated unique workflows, Heimdallr achieves high accuracy on LLM-node identification (F1~=~0.994), triggerability classification (99.8%), and threat-vector detection (micro-average F1~=~0.917). As part of an ongoing detection and disclosure effort, we have so far responsibly disclosed 802 vulnerable workflow instances across 759 repositories and received 71 acknowledgments.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.