Evaluating the Limitations of Protein Sequence Representations for Parkinson's Disease Classification

April 13, 20262604.11852

César Jesús Núñez-Prado, Grigori Sidorov, Liliana Chanona-Hernández

q-bio.QMcs.AIcs.LG

TLDR

Protein sequence representations show limited discriminative power for Parkinson's disease classification, necessitating more complex biological features.

Key contributions

Evaluated diverse protein sequence representations for Parkinson's disease classification.
Used a rigorous, leakage-free nested stratified cross-validation framework.
Found only moderate discriminative performance (F1 ~0.70) across all representations.
Concludes primary sequence data alone is insufficient, requiring structural or functional features.

Why it matters

This paper establishes a crucial baseline for Parkinson's disease classification using protein sequences. It empirically demonstrates the limitations of primary sequence data alone, guiding future research to incorporate more complex biological features like structural or functional descriptors for robust disease modeling.

Original Abstract

The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers