ArXiv TLDR

Scalable Model-Based Clustering with Sequential Monte Carlo

🐦 Tweet
2604.14810

Connie Trojan, Pavel Myshkov, Paul Fearnhead, James Hensman, Tom Minka + 1 more

stat.MLcs.LGstat.CO

TLDR

This paper introduces a novel Sequential Monte Carlo algorithm for scalable model-based clustering by decomposing problems into independent subproblems.

Key contributions

  • Solves high memory requirements of traditional Sequential Monte Carlo for large-scale online clustering.
  • Introduces a novel SMC algorithm that decomposes clustering into approximately independent subproblems.
  • Achieves a more compact algorithm state representation, enabling scalable model-based clustering.
  • Demonstrates accurate and efficient performance on complex problems like knowledge base construction.

Why it matters

This paper makes model-based clustering with Sequential Monte Carlo practical for large, complex datasets, such as text in knowledge bases. It overcomes the memory limitations of previous SMC methods, enabling efficient and accurate online clustering.

Original Abstract

In online clustering problems, there is often a large amount of uncertainty over possible cluster assignments that cannot be resolved until more data are observed. This difficulty is compounded when clusters follow complex distributions, as is the case with text data. Sequential Monte Carlo (SMC) methods give a natural way of representing and updating this uncertainty over time, but have prohibitive memory requirements for large-scale problems. We propose a novel SMC algorithm that decomposes clustering problems into approximately independent subproblems, allowing a more compact representation of the algorithm state. Our approach is motivated by the knowledge base construction problem, and we show that our method is able to accurately and efficiently solve clustering problems in this setting and others where traditional SMC struggles.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.