Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

April 22, 20262604.20658

Shivani Kumar, Adarsh Bharathwaj, David Jurgens

cs.CL

TLDR

Cooperative profiles from behavioral games predict multi-agent LLM team performance in AI for Science workflows, offering a key diagnostic.

Key contributions

Benchmarked 35 LLMs using six behavioral economics games to assess cooperation.
Cooperative profiles from these games predict LLM team performance in AI for Science tasks.
LLMs that prioritize coordination and team production achieve better scientific report outcomes.
Cooperative disposition is a distinct, measurable LLM property, separate from general ability.

Why it matters

This paper introduces a novel, inexpensive method to screen LLMs for cooperative fitness before deployment. It shows that an LLM's cooperative disposition is a distinct, measurable trait crucial for multi-agent system success. This framework helps optimize LLM team selection for complex scientific tasks.

Original Abstract

Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters. Behavioral economics provides a rich toolkit of games that isolate distinct cooperation mechanisms, yet it remains unknown whether a model's behavior in these stylized settings predicts its performance in realistic collaborative tasks. Here, we benchmark 35 open-weight LLMs across six behavioral economics games and show that game-derived cooperative profiles robustly predict downstream performance in AI-for-Science tasks, where teams of LLM agents collaboratively analyze data, build models, and produce scientific reports under shared budget constraints. Models that effectively coordinate games and invest in multiplicative team production (rather than greedy strategies) produce better scientific reports across three outcomes, accuracy, quality, and completion. These associations hold after controlling for multiple factors, indicating that cooperative disposition is a distinct, measurable property of LLMs not reducible to general ability. Our behavioral games framework thus offers a fast and inexpensive diagnostic for screening cooperative fitness before costly multi-agent deployment.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers