ArXiv TLDR

Valence-Arousal Subspace in LLMs: Circular Emotion Geometry and Multi-Behavioral Control

🐦 Tweet
2604.03147

Lihao Sun, Lewen Yan, Xiaoya Lu, Andrew Lee, Jie Zhang + 1 more

cs.CLcs.AIcs.CY

TLDR

Researchers found a valence-arousal subspace in LLMs, enabling precise control over emotional output, refusal, and sycophancy across models.

Key contributions

  • A method identifies a valence-arousal (VA) subspace in LLMs using emotion steering vectors and PCA/ridge regression.
  • The VA subspace exhibits circular geometry, aligning with established human emotion perception models.
  • Steering along VA axes precisely controls LLM emotional output, refusal, and sycophancy across multiple models.
  • Refusal tokens are linked to low-arousal, negative-valence regions, explaining VA steering's behavioral impact.

Why it matters

This paper offers a novel method for fine-grained emotional and behavioral control in LLMs, providing a mechanistic understanding of how emotion influences model outputs. It opens new avenues for developing more nuanced and controllable AI interactions, enhancing safety and alignment.

Original Abstract

We present a method to identify a valence-arousal (VA) subspace within large language model representations. From 211k emotion-labeled texts, we derive emotion steering vectors, then learn VA axes as linear combinations of their top PCA components via ridge regression on the model's self-reported valence-arousal scores. The resulting VA subspace exhibits circular geometry consistent with established models of human emotion perception. Projections along our recovered VA subspace correlate with human-crowdsourced VA ratings across 44k lexical items. Furthermore, steering generation along these axes produces monotonic shifts in the corresponding affective dimensions of model outputs. Steering along these directions also induces near-monotonic bidirectional control over refusal and sycophancy: increasing arousal decreases refusal and increases sycophancy, and vice versa. These effects replicate across Llama-3.1-8B, Qwen3-8B, and Qwen3-14B, demonstrating cross-architecture generality. We provide a mechanistic account for these effects and prior emotionally-framed controls: refusal-associated tokens ("I can't," "sorry") occupy low-arousal, negative-valence regions, so VA steering directly modulates their emission probability.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.