Scalable Model-Based Clustering with Sequential Monte Carlo
Connie Trojan, Pavel Myshkov, Paul Fearnhead, James Hensman, Tom Minka + 1 more
TLDR
This paper introduces a novel Sequential Monte Carlo algorithm for scalable model-based clustering by decomposing problems into independent subproblems.
Key contributions
- Solves high memory requirements of traditional Sequential Monte Carlo for large-scale online clustering.
- Introduces a novel SMC algorithm that decomposes clustering into approximately independent subproblems.
- Achieves a more compact algorithm state representation, enabling scalable model-based clustering.
- Demonstrates accurate and efficient performance on complex problems like knowledge base construction.
Why it matters
This paper makes model-based clustering with Sequential Monte Carlo practical for large, complex datasets, such as text in knowledge bases. It overcomes the memory limitations of previous SMC methods, enabling efficient and accurate online clustering.
Original Abstract
In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.