Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

April 24, 20262604.22662

Inês Oliveira e Silva, Sérgio Jesus, Iker Perez, Rita P. Ribeiro, Carlos Soares + 2 more

cs.LGcs.AIcs.HC

TLDR

This paper shows standard XAI evaluation metrics don't align with human perception or decision utility, and explanations can increase automation bias.

Key contributions

Evaluated 8 Shapley variants in operational risk workflows using a unified framework.
Revealed standard quantitative XAI metrics don't align with human-perceived clarity or utility.
Explanations increased human decision confidence but not objective analyst performance.
Highlights critical risk of automation bias in high-stakes XAI applications.

Why it matters

This paper critically re-evaluates XAI by showing standard metrics fail to predict human utility and can foster automation bias. It provides crucial evidence that current XAI evaluation proxies are insufficient for high-stakes settings, guiding better selection of XAI formulations and metrics.

Original Abstract

Shapley values are a cornerstone of explainable AI, yet their proliferation into competing formulations has created a fragmented landscape with little consensus on practical deployment. While theoretical differences are well-documented, evaluation remains reliant on quantitative proxies whose alignment with human utility is unverified. In this work, we use a unified amortized framework to isolate semantic differences between eight Shapley variants under the low-latency constraints of operational risk workflows. We conduct a large-scale empirical evaluation across four risk datasets and a realistic fraud-detection environment involving professional analysts and 3,735 case reviews. Our results reveal a fundamental misalignment: standard quantitative metrics, such as sparsity and faithfulness, are decoupled from human-perceived clarity and decision utility. Furthermore, while no formulation improved objective analyst performance, explanations consistently increased decision confidence, signaling a critical risk of automation bias in high-stakes settings. These findings suggest that current evaluation proxies are insufficient for predicting downstream human impact, and we provide evidence-based guidance for selecting formulations and metrics in operational decision systems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers