Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

April 22, 20262604.20742

Luigi Lavazza, Gabriele Rotoloni, Sandro Morasca

cs.SE

TLDR

High AUC values in Software Defect Prediction can be misleading, as they don't guarantee superior performance across all thresholds.

Key contributions

Traditional AUC evaluation for Software Defect Prediction (SDP) models can lead to incorrect conclusions.
A high AUC does not guarantee that a model outperforms random for both true and false positive rates.
Decorated ROC curves, highlighting threshold points, are needed for accurate SDP model evaluation.
Alternative representations are required to fully appreciate all relevant aspects of SDP models.

Why it matters

This paper challenges the common reliance on AUC for evaluating software defect prediction models. It highlights that a high AUC doesn't always reflect genuinely superior performance across all operational thresholds, potentially leading to misinterpretations. This work is crucial for improving the reliability and interpretability of SDP model evaluations.

Original Abstract

Background: Receiver Operating Characteristic (ROC) curves are widely used to evaluate the performance of Software Defect Prediction (SDP) models that estimate module fault-proneness, i.e., the probability that a module is faulty. A ROC curve maps a model's performance in terms of True Positive Rate and False Positive Rate for any possible threshold set on fault-proneness. The Area Under the ROC Curve (AUC) summarizes the performance of a model across all possible thresholds. Traditionally, ROC curves completely above the bisector of the ROC space are considered better than random, and high AUC values are associated with good performance. Aim: We investigate whether these beliefs are correct, hence if SDP model evaluation based on ROC curves and AUC is reliable. Method: We decorate ROC curves by highlighting the points corresponding to threshold values. We also represent True Positive Rate and False Positive Rate as functions of the threshold. Thus, we can evaluate whether a model classifies both faulty and non-faulty modules better than the random model. Results: We show that commonly used evaluation criteria may lead to wrong conclusions. Conclusions: A high value of AUC does not guarantee that both the True Positive Rate and the False Positive Rate of a model are better than the random model's for all possible thresholds. Either decorated ROC curves or alternative representations are needed to appreciate all the relevant aspects of SDP models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers