From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning
Kiran Purohit, Ramasuri Narayanam, Soumyabrata Pal
TLDR
SpecGuard improves speculative decoding for LLMs by using internal signals for step-level verification, boosting accuracy and reducing latency in multi-step reasoning.
Key contributions
- Introduces SpecGuard, a framework for step-level verification in speculative decoding using only internal model signals.
- Mitigates error propagation in multi-step reasoning by validating draft candidates with two internal scores.
- Utilizes attention-based grounding and log-probability confidence for robust step acceptance decisions.
- Achieves 3.6% higher accuracy and ~11% lower latency compared to standard and reward-guided SD.
Why it matters
Current speculative decoding struggles with error propagation in multi-step reasoning. SpecGuard offers an efficient, novel solution by using only internal model signals for step-level verification, avoiding external overhead. This significantly improves LLM accuracy and speed for complex reasoning tasks, making inference more reliable.
Original Abstract
Speculative decoding (SD) accelerates large language model inference by allowing a lightweight draft model to propose outputs that a stronger target model verifies. However, its token-centric nature allows erroneous steps to propagate. Prior approaches mitigate this using external reward models, but incur additional latency, computational overhead, and limit generalizability. We propose SpecGuard, a verification-aware speculative decoding framework that performs step-level verification using only model-internal signals. At each step, SpecGuard samples multiple draft candidates and selects the most consistent step, which is then validated using an ensemble of two lightweight model-internal signals: (i) an attention-based grounding score that measures attribution to the input and previously accepted steps, and (ii) a log-probability-based score that captures token-level confidence. These signals jointly determine whether a step is accepted or recomputed using the target, allocating compute selectively. Experiments across a range of reasoning benchmarks show that SpecGuard improves accuracy by 3.6% while reducing latency by ~11%, outperforming both SD and reward-guided SD.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.