annbatch unlocks terabyte-scale training of biological data in anndata

April 2, 20262604.01949

Ilan Gold, Felix Fischer, Lucas Arnoldt, F. Alexander Wolf, Fabian J. Theis

cs.LGq-bio.GN

TLDR

Annbatch enables terabyte-scale biological data training by providing an out-of-core mini-batch loader for anndata, drastically speeding up ML workflows.

Key contributions

Introduces annbatch, an anndata-native mini-batch loader for out-of-core training on disk-backed datasets.
Increases data loading throughput by up to an order of magnitude across diverse biological benchmarks.
Reduces machine learning model training times from days to just hours, enhancing research efficiency.
Ensures full compatibility with the scverse ecosystem, preserving standard biological data formats.

Why it matters

This paper is crucial for biological AI, solving the bottleneck of training ML models on terabyte-scale datasets. Annbatch enables efficient out-of-core training, allowing researchers to leverage larger, more complex data without abandoning standard formats, thus accelerating discovery.

Original Abstract

The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers