ArXiv TLDR

FL-Sailer: Efficient and Privacy-Preserving Federated Learning for Scalable Single-Cell Epigenetic Data Analysis via Adaptive Sampling

🐦 Tweet
2605.04519

Guangyi Zhang, Yi Dai, Yiyun He, Junhao Liu

cs.LGstat.ML

TLDR

FL-Sailer is the first federated learning framework for scATAC-seq data, enabling privacy-preserving multi-institutional analysis.

Key contributions

  • Introduces FL-Sailer, the first federated learning framework for scATAC-seq data analysis.
  • Uses adaptive leverage score sampling to reduce dimensionality by 80% while preserving biological features.
  • Employs an invariant VAE to disentangle biological signals from technical noise via mutual information.
  • Provides a convergence guarantee, showing bounded error for the high-dimensional problem.

Why it matters

This paper enables privacy-preserving multi-institutional collaboration for scATAC-seq data, previously hindered by privacy and data size. FL-Sailer not only makes this feasible but also outperforms centralized methods by effectively handling noise. It establishes FL as a superior paradigm for collaborative epigenomic research.

Original Abstract

Single-cell ATAC-seq (scATAC-seq) enables high-resolution mapping of chromatin accessibility, yet privacy regulations and data size constraints hinder multi-institutional sharing. Federated learning (FL) offers a privacy-preserving alternative, but faces three fundamental barriers in scATAC-seq analysis: ultra-high dimensionality, extreme sparsity, and severe cross-institutional heterogeneity. We propose FL-Sailer, the first FL framework designed for scATAC-seq data. FL-Sailer integrates two key innovations: (i) adaptive leverage score sampling, which selects biologically interpretable features while reducing dimensionality by 80%, and (ii) an invariant VAE architecture, which disentangles biological signals from technical confounders via mutual information minimization. We provide a convergence guarantee, showing that FL-Sailer converges to an approximate solution of the original high-dimensional problem with bounded error. Extensive experiments on synthetic and real epigenomic datasets demonstrate that FL-Sailer not only enables previously infeasible multi-institutional collaborations but also surpasses centralized methods by leveraging adaptive sampling as an implicit regularizer to suppress technical noise. Our work establishes that federated learning, when tailored to domain-specific challenges, can become a superior paradigm for collaborative epigenomic research.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.