ArXiv TLDR

Data anonymization in the presence of outliers via invariant coordinate selection

🐦 Tweet
2605.04833

Katariina Perkonoja, Joni Virta

stat.MEcs.CR

TLDR

This paper introduces ICSA, a robust data anonymization method using invariant coordinate selection to protect data with outliers, outperforming spectral anonymization.

Key contributions

  • Proposes ICSA, a robust anonymization method replacing PCA with Invariant Coordinate Selection (ICS).
  • Theoretically proves that standard spectral anonymization (SA) fails under sufficiently influential outliers.
  • Shows ICSA achieves stronger privacy and comparable/improved utility over SA in various contamination settings.
  • Demonstrates ICSA's superior privacy-utility efficiency on a benchmark clinical dataset.

Why it matters

Protecting sensitive data with outliers is crucial for privacy. This paper offers a robust solution, ICSA, that explicitly accounts for outliers, improving anonymization performance. This advancement is vital for secure data sharing, especially in fields like healthcare where data often contains anomalies.

Original Abstract

Protecting confidential data while preserving utility is particularly challenging when data sets contain outlying observations. Existing latent space anonymization methods, such as spectral anonymization (SA), rely on principal component analysis (PCA) and may therefore be vulnerable to contamination. We investigate anonymization in the presence of outliers and propose ICSA, a robust alternative to SA based on invariant coordinate selection (ICS). By replacing the PCA transformation with ICS, the robustness of the anonymization procedure can be regulated through the choice of scatter matrices. Alongside the methodological development, we derive a theoretical result showing that SA fails under sufficiently influential outliers. To assess the practical implications of this result, we compare the privacy-utility trade-off of ICSA and SA through simulation experiments under varying contamination settings and outlier severities. Our findings indicate that implementations of ICSA based on robust scatter matrices achieve stronger privacy protection than SA, while typically maintaining comparable, and in some cases improved, utility. We further examine the empirical performance of the proposed method using a benchmark clinical data set, where ICSA demonstrates superior overall privacy-utility efficiency relative to SA. These results suggest that explicitly accounting for outliers can materially improve anonymization performance and that robust latent space transformations offer a promising direction for privacy-preserving statistical data release.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.