Reliability Gated Multi-Teacher Distillation for Low Resource Abstractive Summarization

April 3, 20262604.03192

Dipto Sumit, Ankan Kumar Roy, Sadia Khair Rodela, Atia Haque Asha, Mourchona Afrin + 2 more

cs.CLcs.AI

TLDR

This paper introduces reliability-aware multi-teacher distillation methods (EWAD, CPDP) for low-resource summarization, showing when it improves performance.

Key contributions

Introduces EWAD, a token-level mechanism for reliability-aware multi-teacher distillation.
Proposes CPDP, a geometric constraint for student positioning relative to heterogeneous teachers.
Shows logit-level KD provides reliable gains, while complex methods degrade longer summaries.
Demonstrates cross-lingual pseudo label KD retains high ROUGE-L (71-122%) at 3.2x compression.

Why it matters

This paper offers novel reliability-aware distillation methods (EWAD, CPDP) for low-resource summarization. It provides crucial insights into when multi-teacher supervision is effective, highlighting the reliability of logit-level KD and the benefits of cross-lingual transfer. These findings guide more effective application of knowledge distillation in NLP.

Original Abstract

We study multiteacher knowledge distillation for low resource abstractive summarization from a reliability aware perspective. We introduce EWAD (Entropy Weighted Agreement Aware Distillation), a token level mechanism that routes supervision between teacher distillation and gold supervision based on inter teacher agreement, and CPDP (Capacity Proportional Divergence Preservation), a geometric constraint on the student position relative to heterogeneous teachers. Across two Bangla datasets, 13 BanglaT5 ablations, and eight Qwen2.5 experiments, we find that logit level KD provides the most reliable gains, while more complex distillation improves semantic similarity for short summaries but degrades longer outputs. Cross lingual pseudo label KD across ten languages retains 71-122 percent of teacher ROUGE L at 3.2x compression. A human validated multi judge LLM evaluation further reveals calibration bias in single judge pipelines. Overall, our results show that reliability aware distillation helps characterize when multi teacher supervision improves summarization and when data scaling outweighs loss engineering.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers