MASCing: Configurable Mixture-of-Experts Behavior via Activation Steering Masks
Jona te Lintelo, Lichao Wu, Marina Krček, Sengim Karayalçin, Stjepan Picek
TLDR
MASCing enables flexible, retraining-free reconfiguration of MoE LLM behavior for safety by steering expert activation masks at inference time.
Key contributions
- Introduces MASCing, a novel framework for reconfiguring MoE LLM behavior without costly retraining.
- Utilizes an LSTM surrogate model to capture routing dependencies and map routing logits to downstream behaviors.
- Optimizes a steering matrix to apply masks to routing gates at inference, overriding expert selection.
- Significantly improves jailbreak defense (52.5% to 83.9%) and enables controlled content generation (52.6% to 82.0%).
Why it matters
MoE LLMs face safety challenges due to sparse activation and costly adaptation. MASCing offers a lightweight, flexible solution to rapidly configure MoE behavior for diverse safety objectives without retraining. This enables developers to quickly adapt models for specific scenarios, enhancing safety and utility.
Original Abstract
Mixture-of-Experts (MoE) architectures in Large Language Models (LLMs) have significantly reduced inference costs through sparse activation. However, this sparse activation paradigm also introduces new safety challenges. Since only a subset of experts is engaged for each input, model behavior becomes coupled to routing decisions, yielding a difficult-to-control mechanism that can vary across safety-relevant scenarios. At the same time, adapting model behavior through full fine-tuning or retraining is costly, especially when developers need to rapidly configure the same model for different safety objectives. We present MASCing (MoE Activation Steering Configuration), the first framework that enables flexible reconfiguration of MoE behavior across diverse safety scenarios without retraining. MASCing uses an LSTM-based surrogate model to capture cross-layer routing dependencies and map routing logits to downstream behaviors. It then optimizes a steering matrix to identify behavior-relevant expert circuits and, at inference time, applies steering masks to the routing gates to override expert selection. This enables targeted enhancement or suppression of specific behaviors while preserving general language utility. To demonstrate its reconfigurability, we apply MASCing to two different safety-related objectives and observe consistent gains with negligible overhead across seven open-source MoE models. For multi-turn jailbreak defense, it improves the average defense success rate from 52.5% to 83.9%, with gains of up to 89.2%. For adult-content generation, MASCing enables models to comply with such requests that would otherwise be refused, increasing the average generation success rate from 52.6% to 82.0%, with gains of up to 93.0%. These results establish MASCing as a practical, lightweight, and flexible framework for scenario-specific safety reconfiguration in MoE models.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.