Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

May 12, 20262605.12313

Rezarta Islamaj, Joey Chan, Robert Leaman, Jongmyung Jung, Hyeongsoon Hwang + 11 more

cs.CLcs.IR

TLDR

The MedHopQA track benchmarked LLMs on multi-hop medical QA with a new 1,000-pair dataset, highlighting RAG's importance for strong performance.

Key contributions

Introduced MedHopQA, a shared task for multi-hop medical QA benchmarking LLMs.
Developed a novel 1,000-pair dataset requiring two-hop reasoning, emphasizing rare diseases.
Found retrieval-augmented generation (RAG) strategies critical for achieving strong performance.
Top system achieved 89.30% F1 (MedCPT) vs. 67.40% for the zero-shot baseline.

Why it matters

Multi-hop question answering is a significant challenge in the biomedical domain, crucial for integrating information across diverse sources. This paper provides a new benchmark and publicly available dataset, MedHopQA, to drive progress in this area. It also demonstrates the critical role of RAG, guiding future research in developing more capable biomedical QA systems.

Original Abstract

Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an 89.30% F1 score on the MedCPT metric and an 87.30% exact match (EM) score, compared with 67.40% and 60.20%, respectively, for the zero-shot baseline. A central finding of the challenge was that retrieval-augmented generation (RAG) and related retrieval-based strategies were critical for strong performance. In addition, concept-level evaluation improved answer assessment when correct responses differed in surface form. The MedHopQA dataset is publicly available to support continued progress in this important area. Challenge materials: https://www.ncbi.nlm.nih.gov/research/bionlp/medhopqa and benchmark https://www.codabench.org/competitions/7609/

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers