ArXiv TLDR

Compressing Sequences in the Latent Embedding Space: $K$-Token Merging for Large Language Models

🐦 Tweet
2604.15153

Zihao Xu, John Harvill, Ziwei Fan, Yizhou Sun, Hao Ding + 1 more

cs.CLcs.AI

TLDR

K-Token Merging compresses LLM inputs in the latent embedding space, reducing sequence length by up to 75% with minimal performance loss.

Key contributions

  • Addresses high computational costs of LLMs on long prompts due to quadratic self-attention.
  • Introduces K-Token Merging, a novel latent-space compression framework for LLM inputs.
  • Merges contiguous blocks of K token embeddings into a single embedding using a lightweight encoder.
  • Achieves up to 75% input length reduction with minimal performance degradation across tasks.

Why it matters

Long prompts are a major bottleneck for LLMs due to quadratic scaling of self-attention. This paper offers an efficient latent-space compression method, K-Token Merging, that significantly reduces input length. This innovation can make LLMs more practical and cost-effective for processing extensive textual data.

Original Abstract

Large Language Models (LLMs) incur significant computational and memory costs when processing long prompts, as full self-attention scales quadratically with input length. Token compression aims to address this challenge by reducing the number of tokens representing inputs. However, existing prompt-compression approaches primarily operate in token space and overlook inefficiencies in the latent embedding space. In this paper, we propose K-Token Merging, a latent-space compression framework that merges each contiguous block of K token embeddings into a single embedding via a lightweight encoder. The compressed sequence is processed by a LoRA-adapted LLM, while generation remains in the original vocabulary. Experiments on structural reasoning (Textualized Tree), sentiment classification (Amazon Reviews), and code editing (CommitPackFT) show that K-Token Merging lies on the Pareto frontier of performance vs. compression, achieving up to 75% input length reduction with minimal performance degradation.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.