Why Fine-Tuning Encourages Hallucinations and How to Fix It

April 16, 20262604.15574

Guy Kaplan, Zorik Gekhman, Zhen Zhu, Lotem Rozner, Yuval Reif + 3 more

cs.CLcs.AIcs.LGcs.NE

TLDR

This paper shows SFT causes LLM hallucinations due to knowledge degradation and proposes self-distillation or parameter freezing to fix it.

Key contributions

Supervised fine-tuning (SFT) increases LLM hallucinations by degrading pre-trained knowledge.
Proposes a self-distillation SFT method to learn new facts while minimizing existing knowledge loss.
Shows freezing parameter groups can reduce hallucinations when new knowledge acquisition is unnecessary.
Identifies localized interference among overlapping semantic representations as the main driver of SFT-induced hallucinations.

Why it matters

This paper addresses a critical problem in LLMs: factual hallucinations caused by fine-tuning. By offering practical mitigation strategies and uncovering the underlying mechanism, it provides valuable insights for developing more reliable and trustworthy AI models.

Original Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers