ArXiv TLDR

Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation

🐦 Tweet
2604.02324

Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou + 10 more

cs.CLcs.AIcs.LG

TLDR

This paper introduces Grounded Token Initialization (GTI) to improve how LMs learn new vocabulary, outperforming standard mean initialization.

Key contributions

  • Identifies mean initialization as a bottleneck, collapsing new tokens into a degenerate subspace.
  • Proposes the Grounded Token Initialization Hypothesis for better leveraging LM knowledge.
  • Introduces GTI, a lightweight method to map new tokens to distinct, semantically meaningful locations.
  • GTI significantly outperforms standard initialization and other methods on generative recommendation.

Why it matters

This paper addresses a critical bottleneck in extending LMs with new vocabulary for domain-specific tasks like recommendation. It offers a systematic diagnosis of current issues and proposes GTI, a simple yet effective initialization method. This work significantly improves LM adaptation and performance for novel token domains.

Original Abstract

Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.