Hierarchical Probabilistic Principal Component Analysis of Longitudinal Data
Xinyu Zhang, Ameer Qaqish, D. Y. Lin, Didong Li
TLDR
HPPCA is a new two-level probabilistic factor model for high-dimensional longitudinal data with missing values, outperforming existing methods in imputation and prediction.
Key contributions
- Introduces HPPCA, a two-level probabilistic factor model for high-dimensional longitudinal data.
- Separates between-subject variance from time-varying within-subject dynamics using Gaussian processes.
- Uses an EM algorithm with efficient initializers to handle missing data and flexible covariance kernels.
- Outperforms existing methods (PPCA, MFPCA) in imputation accuracy and clinical outcome prediction.
Why it matters
Longitudinal studies with high-dimensional, incomplete data are common, but existing methods struggle with nested variation and temporal dependency. HPPCA provides a robust solution by explicitly modeling these complexities. This advancement significantly improves data imputation and prediction in critical areas like clinical research.
Original Abstract
In many longitudinal studies, a large number of variables are measured repeatedly over time, with substantial missing data. Existing methods, such as probabilistic principal component analysis (PPCA), are ill-equipped to handle such incomplete, high-dimensional longitudinal data, as they fail to account for the nested sources of variation and temporal dependency inherent in repeated measures. We introduce hierarchical probabilistic principal component analysis (HPPCA), a two-level probabilistic factor model that explicitly separates between-subject variance from time-varying within-subject dynamics. The within-subject latent factors are modeled by a Gaussian process. We develop an EM algorithm to handle missing data and flexible covariance kernels, accelerated by computationally efficient initializers. Simulation studies demonstrated that HPPCA robustly recovers model parameters subspaces and substantially outperforms both standard PPCA and multivariate functional PCA in imputation accuracy, even under heavy missingness and model misspecification. An application to the long COVID symptoms in the Researching COVID to Enhance Recovery adult cohort revealed that HPPCA effectively captured the data's hierarchical structure and its learned features significantly improved the prediction of clinical outcomes and the recovery of masked clinical records compared to exisiting methods.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.