Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

April 28, 20262604.25779

cs.LGcs.AI

TLDR

This paper shows that sustained gradient alignment drives subliminal learning in multi-step distillation, even when mitigation methods fail.

Key contributions

Empirically demonstrates that gradient alignment consistently drives subliminal learning in multi-step distillation.
Shows this alignment, though weak, persists throughout training and causally contributes to trait acquisition.
Reveals that liminal training, a mitigation method, attenuates alignment but fails to stop trait acquisition.
Suggests current mitigation methods may be insufficient when first-order gradient drive dominates.

Why it matters

This research provides crucial insights into the mechanisms of subliminal learning in deep neural networks, especially in multi-step training. It highlights the limitations of current mitigation strategies, suggesting a need for new approaches to prevent unintended trait transfer.

Original Abstract

In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers