Predicting Disagreement with Human Raters in LLM-as-a-Judge Difficulty Assessment without Using Generation-Time Probability Signals

May 12, 20262605.12422

cs.CLcs.CY

TLDR

This paper proposes a method to predict disagreement between LLM-as-a-Judge difficulty ratings and human raters, without using generation-time probability signals.

Key contributions

Proposes a method to predict disagreement between LLM-as-a-Judge and human difficulty ratings.
Avoids reliance on generation-time probability signals, a common limitation in prior work.
Utilizes an ordinal scale and embedding spaces (e.g., ModernBERT) for geometric consistency-based prediction.
Achieves higher AUC than probability-based baselines in CEFR-based sentence difficulty assessment.

Why it matters

LLM-as-a-Judge is promising for educational material difficulty assessment, but human disagreement is a challenge. This paper improves reliability by predicting when LLM ratings will differ from human judgment, reducing manual re-rating effort.

Original Abstract

Automatic generation of educational materials using large language models (LLMs) is becoming increasingly common, but assigning difficulty levels to such materials still requires substantial human effort. LLM-as-a-Judge has therefore attracted attention, yet disagreement with human raters remains a major challenge. We propose a method for predicting which LLM-generated difficulty ratings are likely to disagree with human raters, so that such cases can be sent for re-rating. Unlike prior approaches, our method does not rely on generation-time probability signals, which must be collected during rating generation and are often difficult to compare across LLMs. Instead, exploiting the fact that difficulty is an ordinal scale, we use a separate embedding space, such as ModernBERT, and identify disagreement candidates based on the geometric consistency of the rating set. Experiments on English CEFR-based sentence difficulty assessment with GPT-OSS-120B and Qwen3-235B-A22B showed that the proposed method achieved higher AUC for predicting disagreement with human raters than probability-based baselines.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers