NeuroFlake: A Neuro-Symbolic LLM Framework for Flaky Test Classification
Khondaker Tasnia Hoque, Toukir Ahammed
TLDR
NeuroFlake is a neuro-symbolic LLM framework that uses discriminative token mining to accurately classify flaky tests, improving performance and robustness.
Key contributions
- Introduces NeuroFlake, a neuro-symbolic LLM framework for classifying flaky tests on imbalanced datasets.
- Employs Discriminative Token Mining (DTM) to automatically find high-fidelity, statistically significant code tokens.
- Injects DTM-discovered tokens into LLM attention, fusing neural intuition with symbolic precision for better logic comprehension.
- Achieves 69.34% F1-score on FlakeBench, outperforming SOTA (65.79%), and shows superior robustness to adversarial perturbations.
Why it matters
Flaky tests severely impact software reliability. Existing LLMs struggle with their underlying logic, leading to poor generalization. NeuroFlake's neuro-symbolic approach provides a robust solution, improving classification accuracy and stability.
Original Abstract
Flaky tests, which exhibit non-deterministic pass/fail behavior for the same version of code, pose significant challenges to reliable regression testing. While large language models (LLMs) promise for automated flaky test classification, they often fail to comprehend the actual logic behind test flakiness, instead overfitting to superficial textual artifacts (e.g., specific variable names). This semantic fragility leads to poor generalization on real-world imbalance dataset and vulnerability to perturbations. In this paper, we introduce NeuroFlake, a novel neuro-Symbolic framework for classifying flaky tests on highly imbalanced, real-world datasets (FlakeBench). Unlike prior approaches that rely on brittle manual rule and black box learning, NeuroFlake integrates a Discriminative Token Mining (DTM) module to automate the discovery of high-fidelity, statistically significant source code tokens (e.g., specific concurrency primitives or async waits). By injecting these strong latent signals directly into LLM's attention mechanism, we bridge the gap between neural intuition and symbolic precision. Our experiments demonstrate that neuro-symbolic fusion significantly improves classification performance by leveraging classification F1-score to 69.34% while prior state-of-art shows best F1-score 65.79%. However, we rigorously evaluate NeuroFlake's robustness through adversarial stress testing, introducing semantic preserving augmentations (e.g., dead code injection, variable renaming). While baseline models exhibit performance degradation of 8-18 percentage points (pp) on perturbed tests, NeuroFlake maintains performance stability on unseen augmentations dropping only 4-7 pp.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.