Cluster-First Labelling: An Automated Pipeline for Segmentation and Morphological Clustering in Histology Whole Slide Images
Muhammad Haseeb Ahmad, Sharmila Rajendran, Damion Young, Jon Mason
TLDR
A cloud-native pipeline automates histology WSI segmentation and morphological clustering, drastically cutting manual annotation effort.
Key contributions
- Automates WSI segmentation and morphological clustering using a novel cluster-first paradigm.
- Integrates Cellpose-SAM, ResNet-50 embeddings, UMAP, and DBSCAN for robust object processing.
- Achieves 96.8% cluster-label alignment accuracy across 13 diverse tissue types and species.
- Reduces human annotation effort by orders of magnitude by labeling representative clusters.
Why it matters
This pipeline tackles the labor-intensive challenge of WSI annotation by automating segmentation and clustering. It drastically reduces manual effort, allowing human annotators to label clusters instead of individual objects. Its high accuracy and open-source availability offer a significant advancement for pathology research and diagnostics.
Original Abstract
Labelling tissue components in histology whole slide images (WSIs) is prohibitively labour-intensive: a single slide may contain tens of thousands of structures--cells, nuclei, and other morphologically distinct objects--each requiring manual boundary delineation and classification. We present a cloudnative, end-to-end pipeline that automates this process through a cluster-first paradigm. Our system tiles WSIs, filters out tiles deemed unlikely to contain valuable information, segments tissue components with Cellpose-SAM (including cells, nuclei, and other morphologically similar structures), extracts neural embeddings via a pretrained ResNet-50, reduces dimensionality with UMAP, and groups morphologically similar objects using DBSCAN clustering. Under this paradigm, a human annotator labels representative clusters rather than individual objects, reducing annotation effort by orders of magnitude. We evaluate the pipeline on 3,696 tissue components across 13 diverse tissue types from three species (human, rat, rabbit), measuring how well unsupervised clusters align with independent human labels via per-tile Hungarian-algorithm matching. Our system achieves a weighted cluster-label alignment accuracy of 96.8%, with 7 of 13 tissue types reaching perfect agreement. The pipeline, a companion labelling web application, and all evaluation code are released as open-source software.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.