Sample Is Feature: Beyond Item-Level, Toward Sample-Level Tokens for Unified Large Recommender Models

April 17, 20262604.15650

Shuli Wang, Junwei Yin, Changhao Li, Senjie Kou, Chi Wang + 4 more

cs.IR

TLDR

SIF encodes full historical samples into tokens for unified large recommender models, improving performance and resolving feature heterogeneity.

Key contributions

Introduces SIF, encoding full historical raw samples into sequence tokens for unified recommender models.
Sample Tokenizer uses HGAQ to efficiently quantize raw samples, capturing complete sample-level context.
SIF-Mixer performs deep feature interaction over homogeneous sample representations via token and sample-level mixing.
Resolves feature heterogeneity, improving model capacity and deployed on Meituan's food delivery platform.

Why it matters

Current recommender models struggle with fully utilizing sample context and handling diverse feature types. SIF addresses this by encoding raw samples into tokens, unifying feature representation and boosting model capacity. Its successful deployment on Meituan demonstrates significant real-world impact.

Original Abstract

Scaling industrial recommender models has followed two parallel paradigms: \textbf{sample information scaling} -- enriching the information content of each training sample through deeper and longer behavior sequences -- and \textbf{model capacity scaling} -- unifying sequence modeling and feature interaction within a single Transformer backbone. However, these two paradigms still face two structural limitations. Firstly, sample information scaling methods encode only a subset of each historical interaction into the sequence token, leaving the majority of the original sample context unexploited and precluding the modeling of sample-level, time-varying features. Secondly, model capacity scaling methods are inherently constrained by the structural heterogeneity between sequential and non-sequential features, preventing the model from fully realizing its representational capacity. To address these issues, we propose \textbf{SIF} (\emph{Sample Is Feature}), which encodes each historical Raw Sample directly into the sequence token -- maximally preserving sample information while simultaneously resolving the heterogeneity between sequential and non-sequential features. SIF consists of two key components. The \textbf{Sample Tokenizer} quantizes each historical Raw Sample into a Token Sample via hierarchical group-adaptive quantization (HGAQ), enabling full sample-level context to be incorporated into the sequence efficiently. The \textbf{SIF-Mixer} then performs deep feature interaction over the homogeneous sample representations via token-level and sample-level mixing, fully unleashing the model's representational capacity. Extensive experiments on a large-scale industrial dataset validate SIF's effectiveness, and we have successfully deployed SIF on the Meituan food delivery platform.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers