GenRec: A Preference-Oriented Generative Framework for Large-Scale Recommendation
Yanyan Zou, Junbo Qi, Lunsong Huang, Yu Li, Kewei Xu + 5 more
TLDR
GenRec is a generative framework for large-scale recommendation that improves user preference alignment and efficiency, achieving significant online performance gains.
Key contributions
- Proposes Page-wise NTP for stable, dense gradient signals, resolving output inconsistency in generative retrieval.
- Uses an asymmetric Token Merger to compress multi-token item IDs, cutting input length by ~2X without accuracy loss.
- Introduces GRPO-SR, an RL method with hybrid rewards, to align generative outputs with nuanced user preferences.
Why it matters
GenRec addresses critical scaling and preference alignment challenges in large-scale generative recommendation systems. Its novel techniques, including Page-wise NTP and GRPO-SR, enable efficient and accurate preference modeling. Deployed on JD App, it demonstrates substantial improvements in user engagement and transactions.
Original Abstract
Generative Retrieval (GR) offers a promising paradigm for recommendation through next-token prediction (NTP). However, scaling it to large-scale industrial systems introduces three challenges: (i) within a single request, the identical model inputs may produce inconsistent outputs due to the pagination request mechanism; (ii) the prohibitive cost of encoding long user behavior sequences with multi-token item representations based on semantic IDs, and (iii) aligning the generative policy with nuanced user preference signals. We present GenRec, a preference-oriented generative framework deployed on the JD App that addresses above challenges within a single decoder-only architecture. For training objective, we propose Page-wise NTP task, which supervises over an entire interaction page rather than each interacted item individually, providing denser gradient signal and resolving the one-to-many ambiguity of point-wise training. On the prefilling side, an asymmetric linear Token Merger compresses multi-token Semantic IDs in the prompt while preserving full-resolution decoding, reducing input length by ~2X with negligible accuracy loss. To further align outputs with user satisfaction, we introduce GRPO-SR, a reinforcement learning method that pairs Group Relative Policy Optimization with NLL regularization for training stability, and employs Hybrid Rewards combining a dense reward model with a relevance gate to mitigate reward hacking. In month-long online A/B tests serving production traffic, GenRec achieves 9.5% improvement in click count and 8.7% in transaction count over the existing pipeline.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.