ArXiv TLDR

Sustained Gradient Alignment Mediates Subliminal Learning in a Multi-Step Setting: Evidence from MNIST Auxiliary Logit Distillation Experiment

🐦 Tweet
2604.25779

Chayanon Kitkana, Shivam Arora

cs.LGcs.AI

TLDR

This paper shows that sustained gradient alignment drives subliminal learning in multi-step distillation, even when mitigation methods fail.

Key contributions

  • Empirically demonstrates that gradient alignment consistently drives subliminal learning in multi-step distillation.
  • Shows this alignment, though weak, persists throughout training and causally contributes to trait acquisition.
  • Reveals that liminal training, a mitigation method, attenuates alignment but fails to stop trait acquisition.
  • Suggests current mitigation methods may be insufficient when first-order gradient drive dominates.

Why it matters

This research provides crucial insights into the mechanisms of subliminal learning in deep neural networks, especially in multi-step training. It highlights the limitations of current mitigation strategies, suggesting a need for new approaches to prevent unintended trait transfer.

Original Abstract

In the MNIST auxiliary logit distillation experiment, a student can acquire an unintended teacher trait despite distilling only on no-class logits through a phenomenon called subliminal learning. Under a single-step gradient descent assumption, subliminal learning theory attributes this effect to alignment between the trait and distillation gradients, but does not guarantee that this alignment persists in a multi-step setting. We empirically show that gradient alignment remains weakly but consistently positive throughout training and causally contributes to trait acquisition. We show that a mitigation method called liminal training works by attenuating the alignment and fails to stop trait acquisition in this setup. These results suggest that mitigation methods that operate in this regime may not reliably suppress trait acquisition when the first-order drive dominates.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.