Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring

May 12, 20262605.12398

Jamshid Mozafari, Bhawna Piryani, Adam Jatowt

cs.CLcs.IR

TLDR

Q-DAPS estimates LLM question difficulty by analyzing the entropy of answer plausibility scores, outperforming baselines and aligning with human judgment.

Key contributions

Introduces Q-DAPS, a novel method to estimate LLM question difficulty via entropy of answer plausibility scores.
Q-DAPS consistently outperforms baselines across four major QA datasets (TriviaQA, NQ, MuSiQue, QASC).
Demonstrates strong robustness across hyperparameter variations, model sizes, and plausibility estimation paradigms.
Human evaluations confirm Q-DAPS's difficulty estimates strongly align with human judgments.

Why it matters

Accurately estimating question difficulty is crucial for evaluating and improving LLMs. Q-DAPS provides an interpretable, scalable, and bias-resilient method that better captures reasoning challenges for modern QA systems.

Original Abstract

Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets-TriviaQA, NQ, MuSiQue, and QASC-demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS's difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers