LiteMedCoT-VL: Parameter-Efficient Adaptation for Medical Visual Question Answering
Runze Ma, Shunbo Jia, Haonan Lyu, Guo Liu, Caizhi Liao
TLDR
LiteMedCoT-VL enables compact 2B models to achieve advanced medical VQA reasoning by distilling chain-of-thought from a 235B teacher.
Key contributions
- Introduces LiteMedCoT-VL, a pipeline for transferring chain-of-thought reasoning to compact medical VLMs.
- Uses LoRA-based fine-tuning on explanation-enriched data to distill reasoning from a 235B teacher to 2B student models.
- Achieves 64.9% accuracy on PMC-VQA, outperforming a 4B baseline by 11% and all published baselines.
- Demonstrates that 2B models with reasoning distillation can match or exceed models with twice the parameters.
Why it matters
This paper addresses the critical challenge of deploying advanced medical AI on resource-constrained devices. By enabling compact models to perform complex multi-step reasoning, it paves the way for more accessible and interpretable clinical decision support. The approach significantly closes the performance gap between large and small vision-language models in medical VQA.
Original Abstract
The reasoning gap between large and compact vision-language models (VLMs) limits the deployment of medical AI on portable clinical devices. Compact VLMs of 2--4B parameters can run on resource-constrained hardware but lack the multi-step reasoning capacity needed for interpretable clinical decision support. Existing knowledge distillation methods transfer answers without the reasoning process behind them. Medical visual question answering (VQA) serves as a testbed for this problem, as it requires models to integrate visual evidence with clinical knowledge through structured reasoning chains. We introduce LiteMedCoT-VL, a pipeline that transfers chain-of-thought reasoning from a 235B teacher model to 2B student models through LoRA-based fine-tuning on explanation-enriched training data. All inference is conducted without image captions by default, simulating the clinical scenario in which a physician interprets a medical image directly without an accompanying radiology report. On the PMC-VQA benchmark, LiteMedCoT-VL achieves 64.9% accuracy, exceeding the zero-shot Qwen3-VL-4B baseline of 53.9% by 11.0 percentage points and outperforming all published baselines. This result indicates that a 2B model with reasoning distillation can match or exceed models with twice the parameters. Visual grounding analysis shows that the model relies on image content rather than exploiting textual priors. Our code is publicly available at https://anonymous.4open.science/r/LiteMedCoT-VL.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.