Sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks
Minh-Toan Nguyen, Jean Barbier
TLDR
This paper reveals sharp feature-learning transitions and Bayes-optimal neural scaling laws in extensive-width networks, unifying two distinct learning regimes.
Key contributions
- Characterizes Bayes-optimal generalization error and feature overlaps in extensive-width networks.
- Reveals sharp phase transitions where teacher features are sequentially learned as data increases.
- Introduces 'effective width' (k_c) unifying two distinct neural scaling law regimes.
- Shows Bayes-optimal error scales as Θ(k_c d/n), achieved by Adam-trained student models.
Why it matters
This work provides a fundamental understanding of feature learning and generalization in large neural networks. It unifies distinct scaling laws by introducing 'effective width,' offering crucial insights into how generalization error scales with data and optimal student model sizing. This advances the theory of deep learning.
Original Abstract
We study the information-theoretic limits of learning a one-hidden-layer teacher network with hierarchical features from noisy queries, in the context of knowledge transfer to a smaller student model. We work in the high-dimensional regime where the teacher width $k$ scales linearly with the input dimension $d$ -- a setting that captures large-but-finite-width networks and has only recently become analytically tractable. Using a heuristic leave-one-out decoupling argument, validated numerically throughout, we derive asymptotically sharp characterizations of the Bayes-optimal generalization error and individual feature overlaps via a system of closed fixed-point equations. These equations reveal that feature learnability is governed by a sequence of sharp phase transitions: as data grows, teacher features become recoverable sequentially, each through a discontinuous jump in overlap. This sequential acquisition underlies a precise notion of \textit{effective width} $k_c$ -- the number of learnable features at a given data budget $n$ -- which unifies two distinct scaling regimes: a feature-learning regime in which the Bayes-optimal generalization error $\varepsilon^{\rm BO}$ scales as $ n^{1/(2β)-1}$, and a refinement regime in which it scales as $n^{-1}$, where $β>1/2$ is the exponent of the power-law feature hierarchy. Both laws collapse to the single relation $\varepsilon^{\rm BO}=Θ(k_c d/n)$. We further show empirically that a student trained with \textsc{Adam} near the effective width $k_c$ achieves these optimal scaling laws (up to a small algorithmic gap), and provide an information-theoretic account of the associated scaling in model size.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.