ArXiv TLDR

SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection

🐦 Tweet
2605.02888

Shikhar Shukla

cs.LGcs.AIcs.CLcs.DCeess.SY

TLDR

SpecKV adaptively selects speculation length (γ) for LLM decoding, boosting performance by 56% over fixed-γ baselines with minimal overhead.

Key contributions

  • Introduces SpecKV, an adaptive controller for speculative decoding's γ (speculation length).
  • Profiles speculative decoding across tasks, γ values, and compression levels (FP16, INT8, NF4).
  • Uses draft model confidence and entropy to predict optimal γ and maximize tokens per step.
  • Achieves 56.0% speedup over fixed-γ=4 baseline with only 0.34ms per-step overhead.

Why it matters

This paper addresses a key limitation in speculative decoding by making the speculation length adaptive, which is crucial for varying tasks and compressed LLMs. The significant performance gains with minimal overhead make SpecKV a practical and impactful improvement for LLM inference efficiency.

Original Abstract

Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length~$γ$, which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed~$γ$ (typically~4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present \textbf{SpecKV}, a lightweight adaptive controller that selects~$γ$ per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4~task categories, 4~speculation lengths, and 3~compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft entropy, and draft confidence. We demonstrate that the optimal~$γ$ shifts across compression regimes and that draft model confidence and entropy are strong predictors of acceptance rate (correlation~$\approx 0.56$). SpecKV uses a small MLP trained on these signals to maximize expected tokens per speculation step, achieving a 56.0\% improvement over the fixed-$γ$=4 baseline with only 0.34\,ms overhead per decision ($<$0.5\% of step time). The improvement is statistically significant ($p < 0.001$, paired bootstrap test). We release all profiling data, trained models, and notebooks as open-source artifacts.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.