Large Language Models for Variant-Centric Functional Evidence Mining
TLDR
This paper introduces AcmGENTIC, an LLM-powered pipeline and benchmark for automating the extraction and classification of functional evidence for genomic variants.
Key contributions
- Evaluated gpt-4o-mini and o4-mini LLMs on abstract screening and full-text evidence extraction for genomic variants.
- O4-mini achieved 96% accuracy and high specificity in full-text evidence classification compared to gpt-4o-mini.
- Developed AcmGENTIC, an end-to-end pipeline for automated variant-centric functional evidence mining.
- AcmGENTIC integrates literature retrieval, LLM-based filtering, multimodal extraction, and report generation.
Why it matters
This research addresses the labor-intensive process of curating functional evidence for genomic variants, which is crucial for clinical interpretation. By providing a benchmark and an automated pipeline, it offers a practical framework to significantly scale and accelerate this critical task, improving efficiency in genomic medicine.
Original Abstract
Functional evidence is essential for clinical interpretation of genomic variants, but identifying relevant studies and translating experimental results into structured evidence remains labor intensive. We developed a benchmark based on ClinGen curated annotations to evaluate two large language models (LLMs), a non reasoning model (gpt-4o-mini) and a reasoning model (o4-mini), on tasks relevant to functional evidence curation: (1) abstract screening to determine whether a study reports functional experiments directly testing specific variants, and (2) full text evidence extraction and classification from matched variant-paper pairs, including interpretation of evidence direction and generation of evidence summaries. Starting from ClinGen variants annotated with functional evidence, we processed curator comments with an LLM to extract PubMed identifiers, evidence labels, and narrative, and retrieved titles, abstracts, and open access PDFs to construct variant-paper pairs. In abstract screening, both models achieved high recall (0.88-0.90) with moderate specificity (0.59-0.65). For full text evidence classification under an explicit variant matching gate, o4-mini achieved 96% accuracy and higher specificity (0.83 vs. 0.37) while maintaining high F1 (0.98 vs. 0.96) compared with gpt-4o-mini. We also used an LLM-as-judge protocol to compare model generated evidence summaries with expert curator comments. Finally, we developed AcmGENTIC, an end to end pipeline that expands variant identifiers, retrieves literature via LitVar2, filters abstracts with LLMs, acquires PDFs, performs multimodal evidence extraction, and generates evidence reports for curator review, with optional agentic parsing of figures and tables. Together, this benchmark and pipeline provide a practical framework for scaling functional evidence curation with human in the loop LLM assistance.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.