Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation

April 30, 20262604.27747

Jiaju Chen, Chongming Gao, Chenxiao Fan, Haoyan Liu, Qingpeng Cai + 2 more

cs.IRcs.AI

TLDR

PAD-Rec accelerates LLM-based generative recommendation by using position-aware speculative drafting, achieving up to 3.1x speedup.

Key contributions

Introduces PAD-Rec, a module for speculative decoding in LLM-based generative recommendation.
Incorporates item position embeddings to capture within-item token semantics for better drafting.
Uses step position embeddings to adapt to depth-dependent uncertainty, improving proposal quality.
Delivers up to 3.1x wall-clock speedup and ~5% gain over strong speculative decoding baselines.

Why it matters

This paper addresses the critical latency issue in LLM-based generative recommendation systems. By introducing position-aware drafting, it significantly accelerates inference without sacrificing recommendation quality. This advancement makes real-time, LLM-powered recommendations more practical and efficient for large-scale applications.

Original Abstract

Large language model (LLM)-based generative list-wise recommendation has advanced rapidly, but decoding remains sequential and thus latency-prone. To accelerate inference without changing the target distribution, speculative decoding (SD) uses a small draft model to propose several next tokens at once and a target LLM to verify and accept the longest prefix, skipping multiple steps per round. In generative recommendation, however, each item is represented by multiple semantic-ID tokens, often with separators, and current drafts typically treat these tokens uniformly. This overlooks two practical facts: (i) a token's semantics depend on its within-item slot, and (ii) uncertainty tends to increase with speculation depth. Without modeling these effects, SD's speedups can be limited. We introduce PAD-Rec, Position-Aware Drafting for generative Recommendation, a lightweight module that augments the draft model with two complementary signals. Item position embeddings explicitly encode the within-item slot of each token, strengthening structural awareness. Step position embeddings encode the draft step, allowing the model to adapt to depth-dependent uncertainty and improve proposal quality. To harmonize these signals with base features, we add simple gates: a learnable coefficient for item slots and a context-driven gate for draft steps. The module is trainable, easy to integrate with standard draft models, and adds negligible inference overhead. Extensive experiments on four real-world datasets show up to 3.1x wall-clock speedup and about 5% average wall-clock speedup gain over strong SD baselines, while largely preserving recommendation quality.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers