Thomas Wolf
4 papers ยท Latest:
How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data
A study on synthetic data for LLMs reveals structured formats and source data are crucial, while large generators aren't, leading to the efficient FinePhrase dataset.
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2 is a next-generation open-source Code LLM trained on a vastly expanded and diverse dataset, achieving state-of-the-art performance on multiple code benchmarks while being more parameter-efficient than larger models.
GAIA: a benchmark for General AI Assistants
GAIA is a new benchmark designed to evaluate AI assistants on real-world tasks requiring reasoning, multi-modality, web browsing, and tool use, highlighting a significant gap between AI and human performance.
StarCoder: may the source be with you!
StarCoder is a 15.5B parameter open-source code generation model trained on a trillion tokens that outperforms existing open Code LLMs across multiple languages and offers advanced safety and usability features.
๐ฌ Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week โ summarized, scored, and delivered to your inbox every Monday.