Reliability of AI Bots Footprints in GitHub Actions CI/CD Workflows

April 20, 20262604.18334

Syed Muhammad Ashhar Shah, Sehrish Habib, Muizz Hussain, Maryam Abdul Ghafoor, Abdul Ali Bangash

cs.SE

TLDR

This paper analyzes AI bot reliability in GitHub Actions CI/CD, finding agent-dependent success rates and a negative correlation with contribution frequency.

Key contributions

Analyzed 61,837 CI/CD runs from 2,355 repos triggered by 5 AI bots.
Copilot (~93%) and Codex (~94%) showed highest CI/CD success rates among tested AI agents.
Higher AI agent PR frequency negatively correlates with CI/CD workflow success rate.
Identified 13 categories of agentic PR failures, noting shifts from functional to non-functional issues.

Why it matters

This study highlights critical reliability issues of AI bots in CI/CD, providing empirical data on their performance. It underscores the need for specific guidance and safeguards to effectively integrate AI agents into modern software development workflows, ensuring more robust pipelines.

Original Abstract

Continuous Integration and Deployment (CI/CD) workflows are central to modern software delivery, yet the reliability of agentic AI bots operating within these workflows remain underexplored. Using pull requests (PRs), commits, and repositories from the AIDev dataset, we retrieved associated CI/CD workflow runs via the GitHub Actions API and analyzed 61,837 runs from 2,355 repositories, all triggered by PRs generated by five AI bots: Claude, Devin, Cursor, Copilot, and Codex. We observed substantial agent-dependent differences in workflow reliability, with Copilot and Codex achieving the highest success rates ~93% and ~94% respectively. At the repository level, we find a negative correlation between AI agent contribution frequency and workflow success rate, suggesting that a higher frequency of Agentic PRs may hinder CI/CD workflow reliability. We defined a taxonomy of 13 categories against 3,067 agentic PRs whose associated workflows failed, and observed a trend analysis that indicates visually observable shifts from functional to non-functional PR categories over time, although these trends are not statistically significant. Our findings motivate the need for actionable guidance on integrating AI agents into CI/CD workflows and prioritizing safeguards in workflows where failures are most likely to occur.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers