From Weights to Activations: Is Steering the Next Frontier of Adaptation?
Simon Ostermann, Daniil Gurgurov, Tanja Baeumel, Michael A. Hedderich, Sebastian Lapuschkin + 2 more
TLDR
This paper argues that steering, which modifies internal activations, is a distinct and powerful form of language model adaptation.
Key contributions
- Steering is formally recognized as a distinct model adaptation paradigm.
- Proposes functional criteria to compare steering with traditional adaptation techniques.
- Highlights steering's unique benefits: local, reversible, activation-space interventions.
- Advocates for a unified taxonomy of model adaptation methods.
Why it matters
This paper provides a crucial conceptual framework for understanding steering as a distinct and powerful model adaptation paradigm. By clarifying its relationship to existing methods, it paves the way for a unified taxonomy and new research directions in how we adapt large language models.
Original Abstract
Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.