Mitigating Misalignment Contagion by Steering with Implicit Traits

May 4, 20262605.02751

Maria Chang, Ronny Luss, Miao Lui, Keerthiram Murugesan, Karthikeyan Ramamurthy + 1 more

cs.AIcs.CL

TLDR

This paper identifies 'misalignment contagion' in multi-agent LMs and proposes 'steering with implicit traits' to maintain pro-social behavior without model access.

Key contributions

Identifies "misalignment contagion" where LMs become anti-social in multi-turn social dilemma games.
Finds existing system prompt reinforcement is insufficient and often harmful for mitigation.
Proposes "steering with implicit traits" to intermittently reinforce LMs' initial pro-social behaviors.
Effective for black-box models, requiring no access to internal parameters or states.

Why it matters

As LMs are used in complex multi-agent systems, ensuring their alignment is critical. This work highlights a new risk, 'misalignment contagion,' where misaligned behavior can spread. The proposed black-box compatible solution offers a practical way to maintain alignment in real-world applications.

Original Abstract

Language models (LMs) are increasingly used in high-stakes, multi-agent settings, where following instructions and maintaining value alignment are critical. Most alignment research focuses on interactions between a single LM and a single user, failing to address the risk of misaligned behavior spreading between multiple LMs in multi-turn interactions. We find evidence of this phenomenon, which we call misalignment contagion, across multiple LMs as they engage multi-turn conversational social dilemma games. Specifically, we find that LMs become more anti-social after gameplay and that this effect is intensified when other players are steered to act maliciously. We explore different steering techniques to mitigate such misalignment contagion and find that reinforcing an LM's system prompt is insufficient and often harmful. Instead, we propose steering with implicit traits: a technique that intermittently injects system prompts with statements that reinforce an LMs initial traits and is more effective than system prompt repetition at keeping models in line with their initial pro-social behaviors. Importantly, this method does not require access to model parameters or internal model states, making it suitable for increasingly common use cases where complex multi-agent workflows are being designed with black box models.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers