ArXiv TLDR

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

🐦 Tweet
2604.28075

Ansar Aynetdinov, Patrick Haller, Alan Akbik

cs.CLcs.AI

TLDR

For German LLMs, repeating high-quality filtered data consistently outperforms single-pass training on larger, less filtered datasets.

Key contributions

  • Investigates data filtering trade-offs for German LLMs: diversity vs. quality repetition.
  • Finds repeating high-quality data consistently outperforms single-pass training on larger, diverse sets.
  • Performance gap persists even after 7 epochs, showing quality's long-term benefit.
  • Releases state-of-the-art German LLMs (Boldt) trained on 10-360x fewer tokens.

Why it matters

This paper challenges the common wisdom of maximizing data diversity for non-English LLMs. It demonstrates that strategic quality filtering and repetition lead to more efficient and performant models. This approach offers a viable path for developing high-quality LLMs in high-resource non-English languages with significantly reduced computational cost.

Original Abstract

Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and repeating it over multiple epochs? We investigate this trade-off for German by constructing hierarchical quality filters applied to 500M web documents, comparing multi-epoch training on the filtered subsets against single-pass training on a diverse corpus. Our experiments across multiple model scales and token budgets show that repeating high-quality data consistently outperforms single-pass training on larger, less filtered sets. Notably, the performance gap persists even after 7 epochs. Our findings suggest that for non-English LLMs, semantic concentration through quality filtering offers a more viable path to efficient language modeling than simply maximizing unique data volume. We release our German language models (called Boldt), as well as our cleaned evaluation benchmarks to the research community. Our experiments indicate that they achieve state-of-the-art results despite training on 10-360x fewer tokens than comparable models.

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.