Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

April 21, 20262604.19018

Julian Skifstad, Xinyue Annie Yang, Glen Chou

cs.LGcs.AIeess.SYmath.OCstat.ML

TLDR

This paper introduces a novel LQR-based activation steering method for LLMs, leveraging their local linearity to achieve robust, closed-loop behavior control.

Key contributions

Empirically shows LLM layer-wise dynamics are well-approximated by locally-linear models.
Models LLM inference as a linear time-varying system, adapting LQR for closed-loop control.
Achieves state-of-the-art modulation of toxicity, truthfulness, and arbitrary concepts.
Provides theoretical bounds on setpoint tracking error for formal performance guarantees.

Why it matters

Existing activation steering is open-loop and suboptimal. This work leverages LLM's local linearity to enable a robust, closed-loop control system. This significantly improves fine-grained behavior control, surpassing prior methods in modulating various LLM attributes.

Original Abstract

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers