Steer Like the LLM: Activation Steering that Mimics Prompting

May 5, 20262605.03907

cs.CLcs.AIcs.LG

TLDR

This paper introduces Prompt Steering Replacement (PSR) models that mimic prompt-based LLM steering by applying token-specific activation interventions.

Key contributions

Formulates prompt steering as a type of activation steering to bridge the performance gap.
Reveals that prompt steering applies strong, token-specific interventions, unlike current activation methods.
Introduces Prompt Steering Replacement (PSR) models that learn token-specific steering coefficients.
PSR models significantly outperform existing activation steering and rival prompting on several benchmarks.

Why it matters

This work addresses a key limitation in LLM control by making activation steering as effective as prompt steering. It offers a more interpretable and potentially more efficient way to guide LLMs without relying solely on prompt engineering. This could lead to more robust and fine-grained control over model behavior.

Original Abstract

Large language models can be steered at inference time through prompting or activation interventions, but activation steering methods often underperform compared to prompt-based approaches. We propose a framework that formulates prompt steering as a form of activation steering and investigates whether distilling successful prompt steering behavior into simpler, interpretable models can close this gap. Our analysis reveals that popular activation steering methods are not faithful to the mechanics of prompt steering, which applies strong interventions on some tokens while barely affecting others. Based on these insights, we introduce Prompt Steering Replacement (PSR) models that estimate token-specific steering coefficients from the activations themselves and are trained to imitate prompt-based interventions. Experiments on three steering benchmarks across multiple language models show that PSR models outperform existing activation steering methods, especially when controlling for high-coherence completions, and also compare favorably to prompting on AxBench and persona steering.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers