Better Protein Function Prediction by Modeling Survivorship Bias

May 7, 20262605.06879

Zhongmou Chao, Poompol Buathong, Ekaterina Selivanovitch, Susan Daniel, Peter I. Frazier

cs.LGq-bio.QM

TLDR

Evo-PU improves protein function prediction by modeling survivorship bias in sequence data, outperforming existing methods.

Key contributions

Proposes Evo-PU, a PU learning framework that models survivorship bias in protein sequence data.
Uses nucleotide mutation understanding to differentiate between unobserved sequences based on their likelihood of arising.
Outperforms standard PU, one-class classification, and protein language models on single-organism tasks.

Why it matters

This paper addresses a fundamental challenge in protein function prediction by explicitly modeling survivorship bias, a common issue in biological data. By incorporating evolutionary insights, Evo-PU provides more accurate predictions, critical for understanding viral evolution and developing new therapeutics.

Original Abstract

Protein sequence data from nature exhibits survivorship bias: we only observe data from those organisms that survive and reproduce, while non-functional protein mutations are eliminated by natural selection. Thus, predicting whether a protein sequence is functional often requires learning from positive examples alone. While positive-unlabeled (PU) learning frameworks offer a generic solution to this problem, existing PU methods ignore the evolutionary processes that shape sequence observability and cause survivorship bias. Consider a sequence that is one mutation away from a commonly-observed protein variant in a well-surveilled organism. If the sequence were functional, it would likely be observed. If it is not observed, this suggests non-functionality. In contrast, sequences that are unlikely to arise through mutation may be missing simply because they never arose. Thus, these two kinds of missing sequences should be treated differently when training models. In this work, we propose Evo-PU, a PU learning framework that uses a scientific understanding of nucleotide mutation to model survivorship bias for well-surveilled single-organism sequence data. On three prediction tasks using single-organism uniform-coverage surveillance data -- predicting results from held-out influenza and respiratory syncytial virus (RSV) mutagenesis studies, and predicting future SARS-CoV-2 variants -- Evo-PU outperforms standard PU learning, one-class classification (OCC), and protein language models (PLMs). On prediction tasks from multi-organism ProteinGym datasets with more heterogeneous surveillance coverage, we identify opportunities to generalize our approach.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers