ArXiv TLDR

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

🐦 Tweet
2604.15244

Kiran Purohit, Ramasuri Narayanam, Soumyabrata Pal

cs.CL

TLDR

SpecGuard improves speculative decoding for LLMs by using internal signals for step-level verification, boosting accuracy and reducing latency in multi-step reasoning.

Key contributions

  • Introduces SpecGuard, a framework for step-level verification in speculative decoding using only internal model signals.
  • Mitigates error propagation in multi-step reasoning by validating draft candidates with two internal scores.
  • Utilizes attention-based grounding and log-probability confidence for robust step acceptance decisions.
  • Achieves 3.6% higher accuracy and ~11% lower latency compared to standard and reward-guided SD.

Why it matters

Current speculative decoding struggles with error propagation in multi-step reasoning. SpecGuard offers an efficient, novel solution by using only internal model signals for step-level verification, avoiding external overhead. This significantly improves LLM accuracy and speed for complex reasoning tasks, making inference more reliable.

Original Abstract

Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.