Colin Raffel

3 papers · Latest: April 15, 2026

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

A study on synthetic data for LLMs reveals structured formats and source data are crucial, while large generators aren't, leading to the efficient FinePhrase dataset.

2604.13977Apr 15, 2026

Natural Language Processing

Crosslingual Generalization through Multitask Finetuning

This paper demonstrates that multitask finetuning of large multilingual language models on English and machine-translated prompts enables strong zero-shot crosslingual generalization to many languages, including those unseen during training.

2211.01786Nov 3, 2022

Machine Learning

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

This paper introduces a unified text-to-text framework for transfer learning in NLP, achieving state-of-the-art results across diverse language tasks by systematically exploring pre-training and fine-tuning strategies.

1910.10683Oct 23, 2019

📬 Weekly AI Paper Digest

Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.