Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

April 21, 20262604.19281

Abu Noman Md Sakib, Md. Main Oddin Chisty, Zijie Zhang

cs.HCcs.AIcs.CLcs.LG

TLDR

This paper introduces VB-Score, a new component-wise evaluation framework for medical QA LLMs, revealing significant accuracy and health equity issues.

Key contributions

Introduces VB-Score, a component-wise framework evaluating medical QA LLMs beyond semantic similarity.
Reveals significant discrepancies between LLM semantic and entity accuracy in medical contexts.
Identifies alarming performance disparities, indicating condition-based algorithmic discrimination in LLMs.
Demonstrates prompt engineering alone cannot compensate for architectural limits in medical entity extraction.

Why it matters

This paper is crucial as it exposes the severe limitations of current LLM evaluation in medical QA, which often misses critical accuracy and health equity risks. It reveals alarming algorithmic discrimination, particularly for chronic conditions in vulnerable populations. The findings urge a re-evaluation of medical AI safety, proposing a more robust assessment framework.

Original Abstract

The use of Large Language Models (LLMs) to support patients in addressing medical questions is becoming increasingly prevalent. However, most of the measures currently used to evaluate the performance of these models in this context only measure how closely a model's answers match semantically, and therefore do not provide a true indication of the model's medical accuracy or of the health equity risks associated with it. To address these shortcomings, we present a new evaluation framework for medical question answering called VB-Score (Verification-Based Score) that provides a separate evaluation of the four components of entity recognition, semantic similarity, factual consistency, and structured information completeness for medical question-answering models. We perform rigorous reviews of the performance of three well-known and widely used LLMs on 48 public health-related topics taken from high-quality, authoritative information sources. Based on our analyses, we discover a major discrepancy between the models' semantic and entity accuracy. Our assessments of the performance of all three models show that each of them has almost uniformly severe performance failures when evaluated against our criteria. Our findings indicate alarming performance disparities across various public health topics, with most of the models exhibiting 13.8% lower performance (compared to an overall average) for all the public health topics that relate to chronic conditions that occur in older and minority populations, which indicates the existence of what's known as condition-based algorithmic discrimination. Our findings also demonstrate that prompt engineering alone does not compensate for basic architectural limitations on how these models perform in extracting medical entities and raise the question of whether semantic evaluation alone is a sufficient measure of medical AI safety.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers