Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

April 27, 20262604.24715

Parsa Ashrafi Fashi, Utkarsh Saxena, Mehdi Rezagholizadeh, Aref Jafari, Akash Haridas + 5 more

cs.CLcs.LG

TLDR

HyLo upcycles pretrained Transformers into hybrid LLMs, extending context by 32x and reducing KV-cache memory by over 90% for efficient long-context processing.

Key contributions

Upcycles pretrained Transformers into efficient hybrid architectures (HyLo).
Combines architectural adaptation, Multi-Head Latent Attention, and linear blocks.
Extends usable context length by up to 32x and reduces KV-cache memory by >90%.
Outperforms state-of-the-art upcycled baselines on long-context evaluations.

Why it matters

Pure Transformers struggle with long contexts and memory. This paper offers a practical method to convert existing LLMs into efficient hybrid models. It significantly improves long-context capabilities and memory usage, making large language models more accessible and performant for complex tasks.

Original Abstract

Hybrid sequence models that combine efficient Transformer components with linear sequence modeling blocks are a promising alternative to pure Transformers, but most are still pretrained from scratch and therefore fail to reuse existing Transformer checkpoints. We study upcycling as a practical path to convert pretrained Transformer LLMs into hybrid architectures while preserving short-context quality and improving long-context capability. We call our solution \emph{HyLo} (HYbrid LOng-context): a long-context upcycling recipe that combines architectural adaptation with efficient Transformer blocks, Multi-Head Latent Attention (MLA), and linear blocks (Mamba2 or Gated DeltaNet), together with staged long-context training and teacher-guided distillation for stable optimization. HyLo extends usable context length by up to $32\times$ through efficient post-training and reduces KV-cache memory by more than $90\%$, enabling up to 2M-token prefill and decoding in our \texttt{vLLM} inference stack, while comparable Llama baselines run out of memory beyond 64K context. Across 1B- and 3B-scale settings (Llama- and Qwen-based variants), HyLo delivers consistently strong short- and long-context performance and significantly outperforms state-of-the-art upcycled hybrid baselines on long-context evaluations such as RULER. Notably, at similar scale, HyLo-Qwen-1.7B trained on only 10B tokens significantly outperforms JetNemotron (trained on 400B tokens) on GSM8K, Lm-Harness common sense reasoning and RULER-64K.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers