ArXiv TLDR

Beyond N-gram: Data-Aware X-GRAM Extraction for Efficient Embedding Parameter Scaling

🐦 Tweet
2604.21724

Yilong Chen, Yanxi Xie, Zitian Gao, He Xin, Yihao Xiao + 8 more

cs.CL

TLDR

X-GRAM is a new framework that enhances embedding parameter efficiency and scaling in large language models by addressing memory growth and redundancy.

Key contributions

  • X-GRAM uses hybrid hashing and alias mixing to compress long-tail embeddings while maintaining head capacity.
  • Refines retrieved vectors with normalized SwiGLU ShortConv for diverse local n-gram feature extraction.
  • Integrates features into attention and residuals via depth-aware gating, aligning static memory with dynamic context.
  • Improves accuracy by up to 4.4 points over vanilla backbones using 50% smaller embedding tables.

Why it matters

This paper introduces a novel memory-centric scaling approach for large language models. It addresses critical issues of embedding inefficiency and memory growth, offering a practical path to decouple model capacity from computational costs. This enables more scalable and efficient future memory-augmented architectures.

Original Abstract

Large token-indexed lookup tables provide a compute-decoupled scaling path, but their practical gains are often limited by poor parameter efficiency and rapid memory growth. We attribute these limitations to Zipfian under-training of the long tail, heterogeneous demand across layers, and "slot collapse" that produces redundant embeddings. To address this, we propose X-GRAM, a frequency-aware dynamic token-injection framework. X-GRAM employs hybrid hashing and alias mixing to compress the tail while preserving head capacity, and refines retrieved vectors via normalized SwiGLU ShortConv to extract diverse local n-gram features. These signals are integrated into attention value streams and inter-layer residuals using depth-aware gating, effectively aligning static memory with dynamic context. This design introduces a memory-centric scaling axis that decouples model capacity from FLOPs. Extensive evaluations at the 0.73B and 1.15B scales show that X-GRAM improves average accuracy by as much as 4.4 points over the vanilla backbone and 3.2 points over strong retrieval baselines, while using substantially smaller tables in the 50% configuration. Overall, by decoupling capacity from compute through efficient memory management, X-GRAM offers a scalable and practical paradigm for future memory-augmented architectures. Code aviliable in https://github.com/Longyichen/X-gram.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.