Objective Shaping with Hard Negatives: Windowed Partial AUC Optimization for RL-based LLM Recommenders
Wentao Shi, Qifan Wang, Chen Chen, Fei Liu, Dongfang Liu + 5 more
TLDR
This paper introduces Windowed Partial AUC (WPAUC) and TAWin RL to optimize LLM recommenders, improving Top-K performance by better handling hard negatives.
Key contributions
- Analyzes why beam-search negatives improve RL-based LLM recommenders, linking it to partial AUC optimization.
- Shows GRPO with binary rewards maximizes AUC, which is misaligned with Top-K recommendation metrics.
- Introduces Windowed Partial AUC (WPAUC) to directly align objective with Top-K metrics by constraining FPR.
- Proposes TAWin, an efficient RL method to optimize WPAUC, enabling explicit control over Top-K performance.
Why it matters
This paper clarifies why hard negatives improve RL-based LLM recommenders, proposing WPAUC and TAWin to optimize Top-K metrics. It offers a novel, grounded approach for building more accurate and controllable LLM-powered recommendation systems.
Original Abstract
Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$α,α+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.