Understanding the Mechanism of Altruism in Large Language Models

April 21, 20262604.19260

Shuhuai Zhang, Shu Wang, Zijun Yao, Chuanhao Li, Xiaozhi Wang + 2 more

econ.GN

TLDR

Researchers uncover the neural mechanisms of altruism in LLMs using SAEs, identifying and manipulating features linked to prosocial behavior.

Key contributions

Identified specific SAE features (0.024%) in LLMs linked to altruistic vs. selfish behavior in a Dictator Game.
Classified identified features using dual-process theories (System 1/heuristic vs. System 2/deliberative).
Causally validated feature roles; System 2 features more proximally influence LLM output.
Showed that steering these altruism-linked features generalizes across multiple social-preference games.

Why it matters

This research provides a novel mechanistic understanding of prosocial behavior in LLMs, translating abstract altruism into identifiable neural states. It offers a crucial framework for developing more transparent and value-aligned AI systems. By enabling the manipulation of specific behavioral traits, it paves the way for safer and more ethically deployed LLMs.

Original Abstract

Altruism is fundamental to human societies, fostering cooperation and social cohesion. Recent studies suggest that large language models (LLMs) can display human-like prosocial behavior, but the internal computations that produce such behavior remain poorly understood. We investigate the mechanisms underlying LLM altruism using sparse autoencoders (SAEs). In a standard Dictator Game, minimal-pair prompts that differ only in social stance (generous versus selfish) induce large, economically meaningful shifts in allocations. Leveraging this contrast, we identify a set of SAE features (0.024% of all features across the model's layers) whose activations are strongly associated with the behavioral shift. To interpret these features, we use benchmark tasks motivated by dual-process theories to classify a subset as primarily heuristic (System 1) or primarily deliberative (System 2). Causal interventions validate their functional role: activation patching and continuous steering of this feature direction reliably shift allocation distributions, with System 2 features exerting a more proximal influence on the model's final output than System 1 features. The same steering direction generalizes across multiple social-preference games. Together, these results enhance our understanding of artificial cognition by translating altruistic behaviors into identifiable network states and provide a framework for aligning LLM behavior with human values, thereby informing more transparent and value-aligned deployment.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers