Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable Rewards

April 10, 20262604.09855

Shuze Daniel Liu, Claire Chen, Jiabao Sean Xiao, Lei Lei, Yuheng Zhang + 2 more

cs.AIcs.CLcs.GTecon.GN

TLDR

This paper uses Reinforcement Learning with Verifiable Rewards (RLVR) to teach LLMs to negotiate, enabling a 30B agent to outperform larger models.

Key contributions

Introduces Reinforcement Learning with Verifiable Rewards (RLVR) to train LLMs for price negotiation.
Trains a 30B buyer agent against an LLM seller, grounding rewards in economic surplus and budget adherence.
Reveals a four-phase strategic evolution in negotiation, from naive bargaining to sophisticated persuasion.
RLVR-trained 30B agent outperforms frontier LLMs and generalizes robustly to unseen, adversarial counterparties.

Why it matters

LLMs struggle with strategic negotiation. This paper introduces a novel RLVR framework that effectively teaches LLMs to negotiate, leading to superior economic outcomes. It shows smaller, specialized models can outperform much larger LLMs in complex strategic tasks, impacting autonomous agent development.

Original Abstract

The recent advancement of Large Language Models (LLMs) has established their potential as autonomous interactive agents. However, they often struggle in strategic games of incomplete information, such as bilateral price negotiation. In this paper, we investigate if Reinforcement Learning from Verifiable Rewards (RLVR) can effectively teach LLMs to negotiate. Specifically, we explore the strategic behaviors that emerge during the learning process. We introduce a framework that trains a mid-sized buyer agent against a regulated LLM seller across a wide distribution of real-world products. By grounding reward signals directly in the maximization of economic surplus and strict adherence to private budget constraints, we reveal a novel four-phase strategic evolution. The agent progresses from naive bargaining to using aggressive starting prices, moves through a phase of deadlock, and ultimately develops sophisticated persuasive skills. Our results demonstrate that this verifiable training allows a 30B agent to significantly outperform frontier models over ten times its size in extracting surplus. Furthermore, the trained agent generalizes robustly to stronger counterparties unseen during training and remains effective even when facing hostile, adversarial seller personas.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers