ArXiv TLDR

Multi-Aspect Knowledge Distillation for Language Model with Low-rank Factorization

🐦 Tweet
2604.03110

Zihe Liu, Yulong Mao, Jinan Xu, Xinrui Peng, Kaiyu Huang

cs.CL

TLDR

MaKD improves language model compression by deeply mimicking self-attention and feed-forward modules, achieving competitive performance.

Key contributions

  • Introduces Multi-aspect Knowledge Distillation (MaKD) for enhanced language model compression.
  • Deeply mimics self-attention and feed-forward modules to capture fine-grained knowledge.
  • Captures rich, multi-aspect language knowledge during the distillation process.
  • Achieves competitive performance with existing methods and works for auto-regressive models.

Why it matters

Existing knowledge distillation methods often lose fine-grained information during language model compression. MaKD addresses this by deeply mimicking self-attention and feed-forward modules, capturing richer, multi-aspect knowledge. This leads to more effective and robust model compression, crucial for deploying smaller, high-performing LMs.

Original Abstract

Knowledge distillation is an effective technique for pre-trained language model compression. However, existing methods only focus on the knowledge distribution among layers, which may cause the loss of fine-grained information in the alignment process. To address this issue, we introduce the Multi-aspect Knowledge Distillation (MaKD) method, which mimics the self-attention and feed-forward modules in greater depth to capture rich language knowledge information at different aspects. Experimental results demonstrate that MaKD can achieve competitive performance compared with various strong baselines with the same storage parameter budget. In addition, our method also performs well in distilling auto-regressive architecture models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.