Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale

April 26, 20262604.23801

cs.CLcs.IR

TLDR

This paper compares domain fine-tuning and retrieval-augmented generation for medical question answering with 4B LLMs, finding fine-tuning superior.

Key contributions

Compared domain fine-tuning vs. RAG for medical MCQA using 4B LLMs (Gemma/MedGemma).
Found domain fine-tuning improved accuracy by +6.8% on MedQA-USMLE.
RAG did not yield statistically significant gains for either general or fine-tuned models.
Concludes domain knowledge encoded in weights dominates contextual knowledge at this scale.

Why it matters

This paper offers crucial guidance for deploying small LLMs in specialized domains like medicine. It demonstrates that for 4B models on medical QA, domain fine-tuning significantly outperforms RAG for injecting knowledge. This helps practitioners optimize resource allocation and make informed design choices.

Original Abstract

Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote accuracy over the general 4B baseline (53.3% vs. 46.4%, McNemar p < 10^-4). RAG over MedMCQA explanations does not produce a statistically significant gain in either model, and in the domain-tuned model the point estimate is slightly negative (-1.9 pp, p = 0.16). At this scale and on this benchmark, domain knowledge encoded in weights dominates domain knowledge supplied in context. We release the full experiment code and JSONL traces to support replication.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers