How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

May 6, 20262605.05113

cs.LG

TLDR

This paper identifies depth-width scaling regimes for signal propagation in linear recurrences, showing when infinite-width approximations fail.

Key contributions

Derives exact finite-width formulas for hidden state signal energies in linear recurrences.
Identifies three depth-width scaling regimes: subcritical, critical, and supercritical.
Pinpoints the precise recurrent depth scale ($t\gg \sqrt n$) where infinite-width theory breaks down.
Demonstrates finite-width effects accumulate faster in recurrent models than feedforward ones.

Why it matters

This paper addresses a critical gap in signal propagation theory for recurrent models, showing when infinite-width assumptions fail for long sequences. It precisely identifies when standard initialization schemes become unstable and highlights fundamental differences from feedforward networks.

Original Abstract

We study signal propagation in linear recurrent models at finite width. While existing signal propagation theory relies predominantly on the infinite-width limit, it remains unclear for how long that approximation remains accurate when recurrent depth $t$ grows jointly with width $n$. This question is especially relevant for modern recurrent sequence models, whose natural operating regime involves long input sequences, i.e., large $t$. We derive exact finite-width formulas for the hidden state signal energies in linear recurrences under complex Gaussian initialization. Using these formulas, we identify the joint depth-width scaling regimes that govern signal propagation: (i) a subcritical regime $t=o(\sqrt n)$, in which the infinite-width approximation remains valid; (ii) a critical regime $t\sim c\sqrt n$, in which non-negligible deviations from infinite-width predictions appear and a nontrivial joint scaling limit emerges; and (iii) a supercritical regime $t\gg \sqrt n$, in which finite-width effects dominate. Thus, our results pinpoint the precise recurrent depth scale at which infinite-width theory breaks down in long-range linear recurrences. In turn, this shows when standard initialization schemes, such as Glorot, become unstable. More broadly, our results demonstrate that finite-width effects accumulate more rapidly with depth in recurrent models than in feedforward ones, leading to qualitatively different signal propagation behavior.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers