ArXiv TLDR

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling

🐦 Tweet
2605.13801

Deepak Pandita, Flip Korn, Chris Welty, Christopher M. Homan

cs.LGcs.AI

TLDR

This paper introduces a multi-level bootstrapping method to improve AI evaluation reproducibility by modeling annotator behavior and analyzing data tradeoffs.

Key contributions

  • Introduces a multi-level bootstrapping approach to realistically model annotator behavior.
  • Addresses the AI reproducibility crisis stemming from unreliable human evaluations.
  • Leverages datasets with numerous ratings and persistent rater identifiers.
  • Analyzes tradeoffs between items (N) and responses per item (K) for statistical significance.

Why it matters

AI evaluations suffer from a reproducibility crisis due to human rater variance. This paper offers a novel method to model annotator behavior, providing insights into how to achieve reliable and repeatable experimental results. It's crucial for building trustworthy generative AI systems.

Original Abstract

As generative AI models such as large language models (LLMs) become more pervasive, ensuring the safety, robustness, and overall trustworthiness of these systems is paramount. However, AI is currently facing a reproducibility crisis driven by unreliable evaluations and unrepeatable experimental results. While human raters are often used to assess models for utility and safety, they introduce divergent biases and subjective opinions into their annotations. Overcoming this variance is exceptionally challenging because very little data exists to study how experimental repeatability actually improves as the annotator pool grows. Standard evaluation practices typically rely on a small number of annotations per item (often 3 to 5) and lack the persistent rater identifiers necessary to model individual variance across items. In this work, we introduce a multi-level bootstrapping approach to realistically model annotator behavior. Leveraging datasets with a large number of ratings and persistent rater identifiers, we analyze the tradeoffs between the number of items ($N$) and the number of responses per item ($K$) required to achieve statistical significance.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.