Subcritical Signal Propagation at Initialization in Normalization-Free Transformers

April 13, 20262604.11890

cs.LGstat.ML

TLDR

This paper uses APJN to analyze signal propagation at initialization in transformers, explaining why normalization-free models like DyT/Derf are sensitive.

Key contributions

Extends APJN analysis to transformers with bidirectional attention and permutation-symmetric inputs.
Derives recurrence relations for activation statistics and APJNs, predicting attention's effect on deep models.
Shows normalization-free transformers with tanh-like nonlinearities exhibit subcritical signal propagation.
Explains why DyT and Derf transformers are sensitive to initialization and optimization choices.

Why it matters

Understanding signal propagation at initialization is crucial for stable training of deep networks. This work provides a theoretical framework for transformers, particularly normalization-free variants, explaining observed training instabilities in architectures like DyT and Derf. This guides future design for more robust models.

Original Abstract

We study signal propagation at initialization in transformers through the averaged partial Jacobian norm (APJN), a measure of gradient amplification across layers. We extend APJN analysis to transformers with bidirectional attention and permutation-symmetric input token configurations by deriving recurrence relations for activation statistics and APJNs across layers. Our theory predicts how attention modifies the asymptotic behavior of the APJN at large depth and matches APJNs measured in deep vision transformers. The criticality picture known from residual networks carries over to transformers: the pre-LayerNorm architecture exhibits power-law APJN growth, whereas transformers with LayerNorm replaced by elementwise $\tanh$-like nonlinearities have stretched-exponential APJN growth, indicating that the latter are subcritical. Applied to Dynamic Tanh (DyT) and Dynamic erf (Derf) transformers, the theory explains why these architectures can be more sensitive to initialization and optimization choices and require careful tuning for stable training.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers