ArXiv TLDR

Agentic Repository Mining: A Multi-Task Evaluation

🐦 Tweet
2605.04845

Johannes Härtel

cs.SE

TLDR

LLM agents exploring repositories via bash commands achieve competitive classification accuracy and superior robustness compared to simple LLMs.

Key contributions

  • LLM agents classify repository artifacts by dynamically exploring with bash commands.
  • Agents achieve competitive accuracy against simple LLMs across diverse classification tasks.
  • Primary benefit: enhanced robustness, avoiding context-window overflows and scaling independently.
  • Broader context access suggests agents may outperform limited ground truth labeling.

Why it matters

This paper addresses the challenge of accurate and scalable software repository mining. By demonstrating the effectiveness of LLM agents that dynamically retrieve context, it offers a robust alternative to traditional methods. This approach improves classification quality and scalability, especially for large or complex artifacts.

Original Abstract

Mining software repositories often requires classifying artifacts like commits, reviews, code lines, or entire repositories into categories. Human labeling is expensive and error-prone; limited context frequently leads to misclassifications or uncertainty in labels. We investigate whether LLM agents that dynamically explore repositories through standard bash commands can match the classification quality of simple LLMs that receive pre-engineered context. Across four tasks, eight approach configurations, and 4943 classifications, agents achieve competitive accuracy despite retrieving their own context. The primary advantage is robustness: agents avoid context-window overflows and scale independently of artifact size. A manual diagnosis of 100 cases where approaches disagree with the ground truth reveals specification ambiguities and labels produced under limited context, suggesting that accuracy against such ground truth may underestimate approaches with broader context access.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.