Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training
Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky
TLDR
Low-rank pre-training methods yield geometrically distinct solutions from full-rank models and each other, even with similar perplexity, requiring deeper evaluation metrics.
Key contributions
- Evaluated five low-rank pre-training methods against full-rank using 16 geometric and spectral metrics.
- Showed low-rank methods produce geometrically distinct solutions from full-rank and each other, despite similar perplexity.
- Found full-rank models have sharper loss basins along random directions, while low-rank methods are sharper along top-1 PCA.
- Demonstrated perplexity is a poor proxy for solution quality; geometric/spectral metrics better predict downstream performance.
Why it matters
This paper moves beyond perplexity to provide a deeper understanding of low-rank pre-training methods. It reveals that these methods converge to distinct loss landscape regions and internal representations, even when perplexity is similar. This work highlights the need for more comprehensive evaluation metrics, guiding future research and development of memory-efficient LLMs.
Original Abstract
Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.