EFGPP: Exploratory framework for genotype-phenotype prediction
Muhammad Muneeb, David B. Ascher
TLDR
EFGPP is a reproducible framework that integrates diverse genetic and clinical data to improve complex human trait prediction, demonstrated on migraine.
Key contributions
- Presents EFGPP, a reproducible framework for generating, ranking, and combining multiple data types for genotype-to-phenotype prediction.
- Applied EFGPP to migraine prediction using UK Biobank data, integrating genotype, clinical, and metabolomic features.
- Combining multiple data types improved migraine prediction performance (AUC 0.688) over single data types (AUC 0.644).
Why it matters
This paper addresses the challenge of predicting complex traits from heterogeneous genetic data. EFGPP offers a practical framework to prioritize and integrate diverse sources, improving prediction accuracy. Its application to migraine highlights the value of combining data types and cross-trait signals.
Original Abstract
Predicting complex human traits from genetic data is challenging because different genetic, clinical, and molecular data sources often contain different parts of the signal. Here, we present EFGPP, a reproducible framework for generating, ranking, and combining multiple types of data for genotype-to-phenotype prediction. We applied EFGPP to migraine prediction using UK Biobank data from 733 individuals. The framework combined genotype-derived features, principal components, clinical and metabolomic covariates, and polygenic risk scores generated from migraine and depression GWAS using PLINK, PRSice-2, AnnoPred, and LDAK-GWAS. The best single data type achieved a test AUC of 0.644, while combining multiple data types improved performance to 0.688 using migraine-focused inputs and 0.663 using cross-trait depression-derived inputs. Genetic features alone did not outperform the covariates-only baseline, but genotype-derived features performed better than PRS alone, and depression-derived PRS showed useful predictive signal. Overall, EFGPP provides a practical proof-of-concept framework for prioritising and integrating heterogeneous genetic data sources for complex phenotype prediction.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.