RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi + 5 more
TLDR
RoBERTa revisits BERT pretraining with optimized hyperparameters and more data, achieving state-of-the-art NLP performance and revealing that BERT was originally undertrained.
Key contributions
- Conducted a thorough replication study of BERT pretraining to isolate effects of hyperparameters and data size.
- Demonstrated that increasing training data and tuning hyperparameters significantly improves model performance beyond original BERT.
- Achieved state-of-the-art results on multiple benchmarks (GLUE, RACE, SQuAD) surpassing all previously published models.
Why it matters
This paper matters because it challenges the notion that newer models inherently outperform BERT by showing that careful optimization and more extensive training can yield superior results, emphasizing the critical role of training design choices and enabling the community to build on a stronger, openly available baseline.
Original Abstract
Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.
📬 Weekly AI Paper Digest
Get the top 10 AI/ML arXiv papers from the week — summarized, scored, and delivered to your inbox every Monday.