ArXiv TLDR

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

🐦 Tweet
2306.05685

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu + 8 more

cs.CLcs.AI

TLDR

This paper demonstrates that strong large language models like GPT-4 can effectively serve as judges to evaluate other LLM-based chat assistants, closely matching human preferences on open-ended tasks.

Key contributions

  • Identifies and addresses biases and limitations in using LLMs as judges, such as position and verbosity biases.
  • Introduces two new benchmarks—MT-bench and Chatbot Arena—to validate LLM judge evaluations against human preferences.
  • Shows GPT-4 judges achieve over 80% agreement with humans, making LLM-as-a-judge a scalable and explainable evaluation method.

Why it matters

Evaluating conversational AI is difficult due to the complexity of human preferences and the cost of human annotation. This work provides a practical and scalable approach by leveraging strong LLMs as judges, validated through novel benchmarks and extensive human comparisons. This enables more efficient and reliable assessment of chat assistants, accelerating progress in developing better AI systems.

Original Abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.