ArXiv TLDR

Quality-Driven Selective Mutation for Deep Learning

🐦 Tweet
2604.22640

Zaheed Ahmed, Emmanuel Charleson Dapaah, Philip Makedonski, Jens Grabowski

cs.SEcs.LG

TLDR

A probabilistic framework quantifies deep learning mutant quality, enabling selection that reduces generation costs while preserving resistance and realism.

Key contributions

  • Introduces a probabilistic framework to quantify deep learning mutant quality.
  • Measures mutant resistance using statistical killing probabilities.
  • Quantifies mutant realism via generalized Jaccard similarity to real faults.
  • Reduces mutant generation by up to 55.6% without compromising quality.

Why it matters

Deep learning mutation testing struggles with balancing mutant quality (resistance, realism) and generation cost. This paper offers a unified framework to quantify and select high-quality mutants efficiently. This is crucial for improving DL testing and debugging without excessive overhead.

Original Abstract

Mutants support testing and debugging in two roles: (i) as test goals and (ii) as substitutes for real faults. Hard-to-kill mutants provide better guidance for test improvement, while realism is essential when mutants are used to simulate real bugs. Building on these roles, selective mutation for deep learning (DL) aims to reduce the cost of mutant generation and execution by choosing operator configurations that yield resistant and realistic mutants. However, the DL literature lacks a unified measure that captures both aspects. This study presents a probabilistic framework to quantify mutant quality along two complementary axes: resistance and realism. Resistance adapts the classical notion of hard-to-kill mutants to the DL setting using statistical killing probabilities, while realism is measured via the generalized Jaccard similarity between mutant and real-fault detectability patterns. The framework enables ranking and filtering of low-quality mutation-operator configurations without assuming a specific use case. We empirically evaluate the approach on four datasets of real DL faults. Three datasets (CleanML, DeepFD, and DeepLocalize) are used to estimate and select high-quality operator configurations, and the held-out defect4ML dataset is used for validation. Results show that quality-driven selection reduces the number of generated mutants by up to 55.6% while preserving typical levels of resistance and realism under baseline-aligned selection thresholds. These findings confirm that dual-objective selection can lower cost without compromising the usefulness of mutants for either role.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.