Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima

April 10, 20262604.09258

Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen + 1 more

cs.LG

TLDR

Nexus optimizer improves LLM generalization by encouraging common minima across data sources, achieving better downstream performance with the same pretraining loss.

Key contributions

Investigates if LLMs converge to common minima across data sources during pretraining.
Reveals standard optimizers often lead to distant task-specific minima.
Proposes Nexus optimizer, maximizing gradient similarity to encourage common minima.
Nexus significantly boosts downstream performance despite achieving the same pretraining loss.

Why it matters

This work challenges the reliance on pretraining loss as the sole proxy for LLM evaluation. It demonstrates that implicit biases in optimization can unlock significant downstream generalization, offering a new path to enhance LLM capabilities.

Original Abstract

Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to points where task-specific minima are distant from each other. To address this, we propose the Nexus optimizer, which encourages the closeness of these minima by maximizing gradient similarity during optimization. Experiments across models ranging from 130M to 3B parameters, various data mixtures and hyperparameter schedules, show that Nexus \textit{significantly boosts downstream performance}, despite \textit{achieving the same pretraining loss} (see \cref{fig:demo:benchmark}). Notably, on the 3B model, Nexus reduces the out-of-distribution loss by 0.012 and yields up to a 15.0\% accuracy improvement on complex reasoning tasks (e.g., GSM8k). This finding challenges the reliance on pretraining loss as the sole proxy for model evaluation and demonstrates the importance of implicit biases in unlocking downstream generalization.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers