How Much Data is Enough? The Zeta Law of Discoverability in Biomedical Data, featuring the enigmatic Riemann zeta function
TLDR
This paper introduces a zeta-law scaling framework to predict when more biomedical data, better representations, or new modalities will accelerate scientific discovery.
Key contributions
- Proposes a scaling-law framework for cross-modal discoverability using data covariance and signal projections.
- Demonstrates performance metrics follow a "zeta-like" scaling law, involving the Riemann zeta function.
- Explains how representation learning improves sample efficiency by concentrating signal into stable modes.
- Predicts cross-over regimes where optimal model complexity depends on data availability and quality.
Why it matters
This paper provides a theoretical framework for data scaling in biomedical research, moving beyond empirical observations. It offers a principled way to predict when more data, better representations, or new modalities will accelerate discovery, crucial for optimizing resource allocation in large-scale AI-driven science.
Original Abstract
How much data is enough to make a scientific discovery? As biomedical datasets scale to millions of samples and AI models grow in capacity, progress increasingly depends on predicting when additional data will substantially improve performance. In practice, model development often relies on empirical scaling curves measured across architectures, modalities, and dataset sizes, with limited theoretical guidance on when performance should improve, saturate, or exhibit cross-over behavior. We propose a scaling-law framework for cross-modal discoverability based on spectral structure of data covariance operators, task-aligned signal projections, and learned representations. Many performance metrics, including AUC, can be expressed in terms of cumulative signal-to-noise energy accumulated across identifiable spectral modes of an encoder and cross-modal operator. Under mild assumptions, this accumulation follows a zeta-like scaling law governed by power-law decay of covariance spectra and aligned signal energy, leading naturally to the appearance of the Riemann zeta function. Representation learning methods such as sparse models, low-rank embeddings, and multimodal contrastive objectives improve sample efficiency by concentrating useful signal into earlier stable modes, effectively steepening spectral decay and shifting scaling curves. The framework predicts cross-over regimes in which simpler models perform best at small sample sizes, while higher-capacity or multimodal encoders outperform them once sufficient data stabilizes additional degrees of freedom. Applications include multimodal disease classification, imaging genetics, functional MRI, and topological data analysis. The resulting zeta law provides a principled way to anticipate when scaling data, improving representations, or adding modalities is most likely to accelerate discovery.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.