Safety and accuracy follow different scaling laws in clinical large language models

May 5, 20262605.04039

Sebastian Wind, Tri-Thien Nguyen, Jeta Sopa, Mahshad Lotfinia, Sebastian Bickelhaup + 7 more

cs.CLcs.AIcs.LG

TLDR

Clinical LLM safety is not a passive consequence of scaling but a deployment property shaped by evidence quality, retrieval, and context.

Key contributions

Introduces SaFE-Scale, a framework to measure clinical LLM safety across model scale, evidence quality, and retrieval.
Developed RadSaFE-200, a radiology safety benchmark with clinician-defined high-risk error and evidence contradiction labels.
Found clean evidence dramatically improves safety and accuracy, reducing high-risk errors from 12.0% to 2.6%.
Showed standard and agentic RAG do not achieve the same safety profile as clean evidence, with elevated high-risk errors.

Why it matters

This research challenges the assumption that larger clinical LLMs are inherently safer, showing safety is a deployment property. It provides a critical framework and benchmark for evaluating and improving medical AI safety, emphasizing evidence quality and retrieval over raw scaling, which is crucial for developing trustworthy healthcare AI.

Original Abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers