Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

April 9, 20262604.08519

Jiayuan Ye, Vitaly Feldman, Kunal Talwar

cs.CLstat.ML

TLDR

Training data pruning improves LLM factual memorization, enabling smaller models to match larger ones by optimizing information density.

Key contributions

LLMs struggle to memorize facts due to information overload and skewed training data distributions.
Proposes training loss-based data selection to limit facts and flatten their frequency distribution.
Method boosts fact accuracy to capacity limits on semi-synthetic datasets.
GPT2-Small memorizes 1.3X more facts, matching a 10X larger model's performance.

Why it matters

This paper addresses a core LLM limitation: factual memorization and hallucination. By optimizing training data, it enables smaller models to retain significantly more facts, matching larger models' performance. This offers a path to more efficient, reliable, and less computationally expensive LLMs for knowledge tasks.

Original Abstract

Large language models (LLMs) can struggle to memorize factual knowledge in their parameters, often leading to hallucinations and poor performance on knowledge-intensive tasks. In this paper, we formalize fact memorization from an information-theoretic perspective and study how training data distributions affect fact accuracy. We show that fact accuracy is suboptimal (below the capacity limit) whenever the amount of information contained in the training data facts exceeds model capacity. This is further exacerbated when the fact frequency distribution is skewed (e.g. a power law). We propose data selection schemes based on the training loss alone that aim to limit the number of facts in the training data and flatten their frequency distribution. On semi-synthetic datasets containing high-entropy facts, our selection method effectively boosts fact accuracy to the capacity limit. When pretraining language models from scratch on an annotated Wikipedia corpus, our selection method enables a GPT2-Small model (110m parameters) to memorize 1.3X more entity facts compared to standard training, matching the performance of a 10X larger model (1.3B parameters) pretrained on the full dataset.

View on arXiv Download PDF

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.

TLDR

Key contributions

Why it matters

Original Abstract

📬 Weekly AI Paper Digest

Related papers