Neural at ArchEHR-QA 2026: One Method Fits All: Unified Prompt Optimization for Clinical QA over EHRs
Abrar Majeedi, Viswanatha Reddy Gajjala, Sai Prasanna Teja Reddy Bogireddy, Siddhant Rai
TLDR
Neural1.5 uses modular prompt optimization and self-consistency to achieve strong results in clinical QA over EHRs, ranking second overall.
Key contributions
- Introduces Neural1.5, a modular approach for clinical QA over EHRs across four subtasks.
- Employs DSPy's MIPROv2 optimizer for automated, per-stage prompt optimization.
- Utilizes self-consistency voting and stage-specific verification to enhance output reliability.
- Achieved second overall rank in the ArchEHR-QA 2026 shared task, outperforming many fine-tuned models.
Why it matters
This paper demonstrates a cost-effective alternative to model fine-tuning for complex clinical QA. Its modular prompt optimization and self-consistency mechanisms offer a robust framework for improving accuracy and reliability in healthcare AI. This approach is significant for developing scalable and precise EHR-based QA systems.
Original Abstract
Automated question answering (QA) over electronic health records (EHRs) demands precise evidence retrieval, faithful answer generation, and explicit grounding of answers in clinical notes. In this work, we present Neural1.5, our method for the ArchEHR-QA 2026 shared task at CL4Health@LREC 2026, which comprises four subtasks: question interpretation, evidence identification, answer generation, and evidence alignment. Our approach decouples the task into independent, modular stages and employs DSPy"s MIPROv2 optimizer to automatically discover high-performing prompts, jointly tuning instructions and few-shot demonstrations for each stage. Within every stage, self-consistency voting over multiple stochastic inference runs suppresses spurious errors and improves reliability, while stage-specific verification mechanisms (e.g., self-reflection and chain-of-verification for alignment) further refine output quality. Among all teams that participated in all four subtasks, our method ranks second overall (mean rank 4.00), placing 4th, 1st, 4th, and 7th on Subtasks 1-4, respectively. These results demonstrate that systematic, per-stage prompt optimization combined with self-consistency mechanisms is a cost-effective alternative to model fine-tuning for multifaceted clinical QA.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.