DEFault++: Automated Fault Detection, Categorization, and Diagnosis for Transformer Architectures
Sigma Jahan, Saurabh Singh Rajput, Tushar Sharma, Mohammad Masudur Rahman
TLDR
DEFault++ automatically detects, categorizes, and diagnoses faults in transformer architectures, improving repair accuracy for critical AI applications.
Key contributions
- Introduces DEFault++, a hierarchical system for detecting, categorizing (12 types), and diagnosing (45 causes) transformer faults.
- Develops DEFault-bench, a benchmark of 3,739 labeled fault instances generated via systematic mutation testing.
- Achieves >0.96 AUROC for fault detection and >0.85 Macro-F1 for categorization and root-cause diagnosis.
- Increases developer repair accuracy from 57.1% to 83.3% in a study, demonstrating practical utility.
Why it matters
Transformer models are vital, but their silent faults are hard to diagnose with generic tools. DEFault++ provides a specialized, accurate, and interpretable solution for pinpointing transformer component failures. This significantly boosts developer efficiency in repairing critical AI applications.
Original Abstract
Transformer models are widely deployed in critical AI applications, yet faults in their attention mechanisms, projections, and other internal components often degrade behavior silently without raising runtime errors. Existing fault diagnosis techniques often target generic deep neural networks and cannot identify which transformer component is responsible for an observed symptom. In this article, we present DEFault++, a hierarchical learning-based diagnostic technique that operates at three level of abstraction: it detects whether a fault is present, classifies it into one of 12 transformer-specific fault categories (covering both attention-internal mechanisms and surrounding architectural components), and identifies the underlying root cause from up to 45 mechanisms. To facilitate both training and evaluation, we construct DEFault-bench, a benchmark of 3,739 labeled instances obtained through systematic mutation testing. These instances are created across seven transformer models and nine downstream tasks using DEForm, a transformer-specific mutation technique we developed for this purpose. DEFault++ measures runtime behavior at the level of individual transformer components. It organizes these measurements through a Fault Propagation Graph (FPG) derived from the transformer architecture. It then produces an interpretable diagnosis using prototype matching combined with supervised contrastive learning. On DEFault-bench, DEFault++ exceeds an AUROC of 0.96 for detection and a Macro-F1 of 0.85 for both categorization and root-cause diagnosis on encoder and decoder architectures. In a developer study with 21 practitioners, the accuracy of choosing correct repair actions increased from 57.1% without support to 83.3% when using DEFault++.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.