Evaluating Small Open LLMs for Medical Question Answering: A Practical Framework
TLDR
This paper introduces an open-source framework to evaluate small open LLMs for medical QA, highlighting low reproducibility as a critical safety concern.
Key contributions
- Presents an open-source framework for evaluating small, local LLMs in medical question answering.
- Evaluates LLMs on 8 quality metrics (e.g., BERTScore) and 2 within-model reproducibility metrics.
- Reveals low self-agreement (<=0.20) and high output uniqueness (87-97%) across repeated LLM runs.
- Shows MedGemma 1.5 4B underperforms larger models in quality and reproducibility.
Why it matters
Reliability is crucial for LLMs in medical settings, especially in online health communities. This framework exposes a significant safety gap in current small LLMs due to their inconsistent outputs. It provides practitioners with tools to assess reproducibility for safer model deployment.
Original Abstract
Incorporating large language models (LLMs) in medical question answering demands more than high average accuracy: a model that returns substantively different answers each time it is queried is not a reliable medical tool. Online health communities such as Reddit have become a primary source of medical information for millions of users, yet they remain highly susceptible to misinformation; deploying LLMs as assistants in these settings amplifies the need for output consistency alongside correctness. We present a practical, open-source evaluation framework for assessing small, locally-deployable open-weight LLMs on medical question answering, treating reproducibility as a first-class metric alongside lexical and semantic accuracy. Our pipeline computes eight quality metrics, including BERTScore, ROUGE-L, and an LLM-as-judge rubric, together with two within-model reproducibility metrics derived from repeated inference (N=10 runs per question). Evaluating three models (Llama 3.1 8B, Gemma 3 12B, MedGemma 1.5 4B) on 50 MedQuAD questions (N=1,500 total responses) reveals that despite low-temperature generation (T=0.2), self-agreement across runs reaches at most 0.20, while 87-97% of all outputs per model are unique -- a safety gap that single-pass benchmarks entirely miss. The clinically fine-tuned MedGemma 1.5 4B underperforms the larger general-purpose models on both quality and reproducibility; however, because MedGemma is also the smallest model, this comparison confounds domain fine-tuning with model scale. We describe the methodology in sufficient detail for practitioners to replicate or extend the evaluation for their own model-selection workflows. All code and data pipelines are available at https://github.com/aviad-buskila/llm_medical_reproducibility.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.