ArXiv TLDR

Secure Cross-Silo Synthetic Genomic Data Generation

🐦 Tweet
2604.27456

Daniil Filienko, Martine De Cock, Sikha Pentyala

cs.CR

TLDR

This paper introduces a secure method for generating synthetic genomic data across multiple institutions using MPC and DP, enabling privacy-preserving AI development.

Key contributions

  • Enables joint training of synthetic data generators across multiple sites without revealing raw data.
  • Combines Secure Multi-Party Computation (MPC) to ensure input privacy for all parties.
  • Integrates Differential Privacy (DP) to provide output privacy from the released synthetic data.
  • Empirically validated on real RNA-seq cohorts, demonstrating utility in federated settings.

Why it matters

This paper addresses a critical challenge in genomic AI by enabling secure data sharing across institutions. It allows researchers to leverage distributed sensitive data for AI development without compromising individual privacy, accelerating progress in areas like rare disease research.

Original Abstract

Access to genomic data is highly regulated due to its sensitive nature. While safeguards are essential, cumbersome data access processes pose a significant barrier to the development of AI methods for genomics. Synthetic data generation can mitigate this tension by enabling broader data sharing without exposing sensitive information. Synthetic genomic data are produced by training generative models on real data and subsequently sampling artificial data that preserves relevant statistics while limiting disclosures about the underlying individuals. In some settings, a single data holder may have sufficient data to train such generative models; however, in many applications data must be combined across multiple sites to achieve adequate scale. This need arises, e.g., in rare disease studies, where individual hospitals typically hold data for only a small number of patients. The solution we present in this paper enables multiple data holders to jointly train a synthetic data generator without revealing their raw data. Our approach combines secure multiparty computation (MPC) to ensure input privacy, so that no party ever discloses its data in unencrypted form, with differential privacy (DP) to provide output privacy by mitigating information leakage from the released synthetic data. We empirically demonstrate the effectiveness of the proposed method by generating high-utility synthetic datasets from multiple real RNA-seq cohorts in federated settings, showing that our approach enables privacy-preserving data synthesis even when data are distributed across institutions.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.