ArXiv TLDR

SieveFL: Hierarchical Runtime-Aware Pruning for Scalable LLM-Based Fault Localization

🐦 Tweet
2605.13491

Mahdi Farzandway, Fatemeh Ghassemi

cs.SE

TLDR

SieveFL is a hierarchical framework that uses aggressive pre-LLM filtering and runtime-aware pruning to enable scalable and accurate fault localization with commodity LLMs.

Key contributions

  • Introduces SieveFL, a 5-stage hierarchical framework for scalable LLM-based fault localization.
  • Aggressively filters candidates using vector retrieval and JaCoCo runtime traces before LLM processing.
  • Achieves 41.8% Top-1 accuracy on Defects4J, outperforming AgentFL by 2.1 pp.
  • Reduces candidate methods by 79% and LLM token consumption by 49% while improving ranking.

Why it matters

LLMs struggle with fault localization at scale due to high token costs and signal dilution. SieveFL addresses this by aggressively filtering candidates, making LLM-based fault localization practical and efficient on commodity hardware. This approach shows that powerful fault localization is achievable without proprietary frontier models.

Original Abstract

Automated fault localization requires connecting an observed test failure to the responsible method across thousands of candidates--a task that purely statistical approaches handle with limited precision and that LLMs cannot yet handle at full project scale due to prohibitive token cost and signal dilution. We present SieveFL, a five-stage hierarchical framework that resolves this tension through aggressive pre-LLM filtering. SieveFL converts a failing test into a natural-language failure description, uses dense vector retrieval to narrow the search to a small set of suspicious files, and then eliminates any method not executed during the failing test via JaCoCo runtime traces. Only the surviving candidates are passed to the LLM, which screens each method individually and re-ranks the confirmed suspects in a single comparative pass. We evaluate SieveFL on 395 bugs from Defects4J v1.2.0 using a mid-sized, openly available MoE model deployed on a commodity workstation (32 GB RAM, 8 GB GPU) via Ollama--no frontier APIs or datacenter hardware required. Treating 12 incomplete runs as failures, SieveFL achieves Top-1 accuracy of 41.8% (165/395 bugs) and an MRR of 0.469, outperforming the strongest prior agent-based baseline (AgentFL) by 2.1 pp in Top-1. Runtime pruning removes 79% of candidate methods and reduces input token consumption by 49%, while simultaneously improving ranking quality: Top-1 is preserved exactly and Top-3 through Top-10 improve by up to 2.4 pp. These results demonstrate that, with the right filtering architecture, capable fault localization does not require proprietary frontier models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.