Distributionally Robust K-Means Clustering
Vikrant Malik, Taylan Kargin, Babak Hassibi
TLDR
This paper introduces a distributionally robust K-Means clustering method that uses a Wasserstein-2 ball to protect against outliers and distribution shifts.
Key contributions
- Introduces a distributionally robust K-Means variant to handle outliers and distribution shifts.
- Models unknown population distribution within a Wasserstein-2 ball around empirical data.
- Derives a minimax formulation with a tractable dual, enabling a soft-clustering scheme.
- Proposes an efficient block coordinate descent algorithm with provable convergence.
Why it matters
K-Means is widely used but struggles with real-world data issues like outliers. This work provides a theoretically sound and practically effective solution to make K-Means more robust. It significantly improves outlier detection and noise robustness, enhancing its utility in diverse applications.
Original Abstract
K-means clustering is a workhorse of unsupervised learning, but it is notoriously brittle to outliers, distribution shifts, and limited sample sizes. Viewing k-means as Lloyd--Max quantization of the empirical distribution, we develop a distributionally robust variant that protects against such pathologies. We posit that the unknown population distribution lies within a Wasserstein-2 ball around the empirical distribution. In this setting, one seeks cluster centers that minimize the worst-case expected squared distance over this ambiguity set, leading to a minimax formulation. A tractable dual yields a soft-clustering scheme that replaces hard assignments with smoothly weighted ones. We propose an efficient block coordinate descent algorithm with provable monotonic decrease and local linear convergence. Experiments on standard benchmarks and large-scale synthetic data demonstrate substantial gains in outlier detection and robustness to noise.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.