ArXiv TLDR

Stop Using the Wilcoxon Test: Myth, Misconception and Misuse in IR Research

🐦 Tweet
2604.25349

Julián Urbano

cs.IRstat.APstat.ME

TLDR

The Wilcoxon signed-rank test is routinely misapplied in Information Retrieval, leading to unreliable results and should be abandoned.

Key contributions

  • Reveals how the Wilcoxon test is routinely misapplied in IR research due to misconceptions.
  • Highlights inconsistencies in statistics textbooks regarding assumptions of statistical tests.
  • Empirically demonstrates how Wilcoxon loses control of its Type I error rate in IR settings.
  • Argues that abandoning the Wilcoxon test would improve methodological soundness in IR.

Why it matters

This paper critically re-evaluates the widespread use of the Wilcoxon test in IR, revealing its fundamental flaws and potential for misleading results. By advocating for its abandonment, it aims to significantly enhance the methodological rigor and reliability of IR research.

Original Abstract

In benchmarking of Information Retrieval systems, the Wilcoxon signed-rank test is often treated as a safer alternative to the t-test. This belief is fueled by textbooks and recommendations that portray Wilcoxon as the proper non-parametric alternative because metric scores are not normally distributed. We argue that this narrative is misleading and harmful. A careful review of Statistics textbooks reveals inconsistencies and omissions in how the assumptions underlying these tests are presented, fostering confusion that has propagated into IR research. As a result, Wilcoxon has been routinely misapplied for decades, creating a false sense of safety against a threat that was never there to begin with, while introducing another one so severe that it virtually guarantees the test will break down and mislead researchers. Through a combination of systematic literature review, analysis and empirical demonstrations with TREC data, we show how and why the Wilcoxon test easily loses control of its Type I error rate in IR settings. We conclude that the continued use of Wilcoxon in IR evaluation is unjustified and that abandoning it would improve the methodological soundness of our field.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.