Thomas Wolf

4 papers · Latest: April 15, 2026

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

A study on synthetic data for LLMs reveals structured formats and source data are crucial, while large generators aren't, leading to the efficient FinePhrase dataset.

2604.13977Apr 15, 2026

Software Engineering

StarCoder 2 and The Stack v2: The Next Generation

StarCoder2 is a next-generation open-source Code LLM trained on a vastly expanded and diverse dataset, achieving state-of-the-art performance on multiple code benchmarks while being more parameter-efficient than larger models.

2402.19173Feb 29, 2024

Natural Language Processing

GAIA: a benchmark for General AI Assistants

GAIA is a new benchmark designed to evaluate AI assistants on real-world tasks requiring reasoning, multi-modality, web browsing, and tool use, highlighting a significant gap between AI and human performance.

2311.12983Nov 21, 2023

Natural Language Processing

StarCoder: may the source be with you!

StarCoder is a 15.5B parameter open-source code generation model trained on a trillion tokens that outperforms existing open Code LLMs across multiple languages and offers advanced safety and usability features.

2305.06161May 9, 2023

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.