Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings
Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares + 2 more
TLDR
This paper shows standard XAI evaluation metrics don't align with human perception or decision utility, and explanations can increase automation bias.
Key contributions
- Evaluated 8 Shapley variants in operational risk workflows using a unified framework.
- Revealed standard quantitative XAI metrics don't align with human-perceived clarity or utility.
- Explanations increased human decision confidence but not objective analyst performance.
- Highlights critical risk of automation bias in high-stakes XAI applications.
Why it matters
This paper critically re-evaluates XAI by showing standard metrics fail to predict human utility and can foster automation bias. It provides crucial evidence that current XAI evaluation proxies are insufficient for high-stakes settings, guiding better selection of XAI formulations and metrics.
Original Abstract
Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.