MoRFI: Monotonic Sparse Autoencoder Feature Identification
Dimitris Dimakopoulos, Shay B. Cohen, Ioannis Konstas
TLDR
MoRFI identifies sparse autoencoder features showing how fine-tuning LLMs on new facts disrupts stored knowledge, increasing hallucinations.
Key contributions
- Controlled fine-tuning experiments on Llama 3.1, Gemma 2, and Mistral 7B.
- Confirms new knowledge and prolonged training exacerbate LLM hallucinations.
- Proposes MoRFI to identify SAE features causally linked to hallucination increase.
- Discovers specific latent directions disrupted by new facts, impacting knowledge retrieval.
Why it matters
This research illuminates the underlying mechanisms of LLM hallucinations caused by fine-tuning with new information. By pinpointing specific latent directions, it offers a crucial step towards understanding and mitigating this common problem, enhancing LLM reliability and factual consistency.
Original Abstract
Large language models (LLMs) acquire most of their factual knowledge during the pre-training stage, through next token prediction. Subsequent stages of post-training often introduce new facts outwith the parametric knowledge, giving rise to hallucinations. While it has been demonstrated that supervised fine-tuning (SFT) on new knowledge may exacerbate the problem, the underlying mechanisms are still poorly understood. We conduct a controlled fine-tuning experiment, focusing on closed-book QA, and find latent directions that causally contribute to hallucinations. Specifically, we fine-tune Llama 3.1 8B, Gemma 2 9B and Mistral 7B v03 on seven distinct single QA datasets, controlling for the percentage of new knowledge and number of training epochs. By measuring performance on the test set, we validate that incrementally introducing new knowledge increases hallucinations, with the effect being more pronounced with prolonged training. We leverage pre-trained sparse autoencoders (SAEs) to analyze residual stream activations across various checkpoints for each model and propose Monotonic Relationship Feature Identification (MoRFI) for capturing causally relevant latents. MoRFI filters SAE features that respond monotonically to controlled fine-tuning data mixtures of a target property. Our findings show that exposure to unknown facts disrupts the model's ability to retrieve stored knowledge along a set of directions in the residual stream. Our pipeline reliably discovers them across distinct models, recovering knowledge through single-latent interventions.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.