ArXiv TLDR

scpFormer: A Foundation Model for Unified Representation and Integration of the Single-Cell Proteomics

🐦 Tweet
2604.20003

Qifeng Zhou, Lei Yu, Yuzhi Guo, Yuwei Miao, Hehuan Ma + 3 more

q-bio.QMcs.AIcs.LG

TLDR

scpFormer is a transformer-based foundation model for single-cell proteomics that unifies data from fragmented antibody panels.

Key contributions

  • Introduces scpFormer, a transformer-based foundation model pre-trained on over 390 million single cells.
  • Unifies fragmented antibody panel data using continuous, sequence-anchored tokenization and ESM embeddings.
  • Achieves competitive performance in large-scale batch integration and unsupervised clustering.
  • Enables in silico panel expansion and transfers protein co-expression logic to bulk-omics tasks.

Why it matters

This paper introduces scpFormer, a critical advancement for single-cell proteomics by overcoming the challenge of fragmented antibody panels. Its ability to integrate diverse datasets and expand panels in silico will accelerate biomarker discovery. This model holds significant promise for precision oncology and broader biomedical research.

Original Abstract

The integration of single-cell proteomic data is often hindered by the fragmented nature of targeted antibody panels. To address this limitation, we introduce scpFormer, a transformer-based foundation model designed for single-cell proteomics. Pre-trained on over 390 million cells, scpFormer replaces standard index-based tokenization with a continuous, sequence-anchored approach. By combining Evolutionary Scale Modeling (ESM) with value-aware expression embeddings, it dynamically maps variable panels into a shared semantic space without artificial discretization. We demonstrate that scpFormer generates global cell representations that perform competitively in large-scale batch integration and unsupervised clustering. Moreover, its open-vocabulary architecture facilitates in silico panel expansion, assisting in the reconstruction of biological manifolds in sparse clinical datasets. Finally, this learned protein co-expression logic is transferable to bulk-omics tasks, supporting applications like cancer drug response prediction. scpFormer provides a versatile, panel-agnostic framework to facilitate scalable biomarker discovery and precision oncology.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.