ArXiv TLDR

RecRM-Bench: Benchmarking Multidimensional Reward Modeling for Agentic Recommender Systems

🐦 Tweet
2605.11874

Wenwen Zeng, Jinhui Zhang, Hao Chen, Zhaoyu Hu, Yongqi Liang + 8 more

cs.IR

TLDR

RecRM-Bench introduces a comprehensive benchmark for multi-dimensional reward modeling in LLM-agent recommender systems, addressing current limitations.

Key contributions

  • Introduces RecRM-Bench, the largest and most comprehensive benchmark for agentic recommender systems.
  • Comprises over 1M entries across 4 dimensions: instruction following, factual consistency, relevance, and user behavior.
  • Enables comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling.
  • Proposes a systematic framework for building multi-dimensional reward models and hybrid reward functions.

Why it matters

Current LLM-agent recommender systems rely on simplistic, single-dimensional rewards, hindering their ability to understand complex user intents and follow instructions. RecRM-Bench provides the necessary tools to develop sophisticated multi-dimensional reward models, crucial for building more reliable and capable interactive recommendation agents.

Original Abstract

The integration of Large Language Model (LLM) agents is transforming recommender systems from simple query-item matching towards deeply personalized and interactive recommendations. Reinforcement Learning (RL) provides an essential framework for the optimization of these agents in recommendation tasks. However, current methodologies remain limited by a reliance on single dimensional outcome-based rewards that focus exclusively on final user interactions, overlooking critical intermediate capabilities, such as instruction following and complex intent understanding. Despite the necessity for designing multi-dimensional reward, the field lacks a standardized benchmark to facilitate this development. To bridge this gap, we introduce RecRM-Bench, the largest and most comprehensive benchmark to date for agentic recommender systems. It comprises over 1 million structured entries across four core evaluation dimensions: instruction following, factual consistency, query-item relevance, and fine-grained user behavior prediction. By supporting comprehensive assessment from syntactic compliance to complex intent grounding and preference modeling, RecRM-Bench provides a foundational dataset for training sophisticated reward models. Furthermore, we propose a systematic framework for the construction of multi-dimensional reward models and the integration of a hybrid reward function, establishing a robust foundation for developing reliable and highly capable agentic recommender systems. The complete RecRM-Bench dataset is publicly available at https://huggingface.co/datasets/wwzeng/RecRM-Bench.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.