BERT (Devlin et. al.) is a pioneering Language Model that is pretrained for a Denoising Autoencoding objective to produce state of the art results in many NLP tasks. However, there is still room for improvement in the original BERT model w.r.t its pretraining objectives, the data on which it is trained, the duration for which it is trained, etc. These issues were identified by Facebook AI Research (FAIR), and hence, they proposed an ‘optimized’ and ‘robust’ version of BERT.
In this article we’ll be discussing RoBERTa: Robustly Optimized BERT-Pretraining Approach proposed in Liu et. al. which is an extension to the original BERT model. The prerequisite for this article would be general awareness about BERT’s architecture, pretraining and fine-tuning objectives, which by default includes sufficient awareness about the Transformer model (Vaswani et. al.).
I have already covered Transformers in this article; and BERT in this article. Consider giving them a read if you’re interested.
If I were to summarize the RoBERTa paper in one line:
It essentially includes fine-tuning the original BERT model along with data and inputs manipulation.
Yep, that’s it! The authors of RoBERTa suggest that BERT is largely undertrained and hence, they put forth some improvements for the same. In the upcoming sections, we’ll discuss the whats and hows of this fine-tuning.
数据 (Data)
It has been observed that training BERT on larger datasets, greatly improves its performance. So RoBERTa is trained on a vast dataset that goes over 160GB of uncompressed text. This dataset is composed of the following corpora:
BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.
CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.
OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.
Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.
静态v / s动态遮罩 (Static v/s Dynamic Masking)
The masked language modeling objective in BERT pretraining is essentially masking a few tokens from each sequence at random and then predicting these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.
To avoid this, in the re-implementation of BERT, the authors duplicated the training data 10 times so that each sequence was masked in 10 different patterns. This was trained for 40 epochs, i.e. each sequence was trained for the same masking patterns 4 times.
In addition to this, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model.
The aforementioned results show that the re-implementation with static masking yields almost the same results as that of original BERT’s masking approach. Dynamic masking has comparable or slightly better results than the static approaches. Hence in RoBERTa, the dynamic masking approach is adopted for pretraining.
输入表示法和下一句预测 (Input Representations and Next Sentence Prediction)
The original BERT paper suggests that the Next Sentence Prediction (NSP) task is essential for obtaining the best results from the model. Recent studies have questioned the necessity of this objective in pretraining.
So, now we’ll see the different types of input representations that could be used with BERT and how they’d help with eliminating the NSP objective in pretraining:
Segment-Pair + NSP: This is the input representation used in the BERT implementation. Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (which is the maximum fixed sequence length for the BERT model).
Segment-Pair + NSP:这是BERT实现中使用的输入表示形式。 每个输入都有一对来自原始文档或某些不同文档的片段(句段,而不是句子),概率为0.5,然后随机进行训练,然后针对文本含义或自然语言推论(NLI)目标进行训练。 组合的总长度必须小于512个令牌(这是BERT模型的最大固定序列长度)。
Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, it is evident that the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.
句子对+ NSP:与句段对表示相同,只是句子对。 但是,很明显,这里的序列总长度将比512小得多。因此,使用了较大的批处理大小,因此每个训练步骤处理的令牌数量类似于段对表示形式。
Full-Sentences: Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the length of the sequence is at most 512.
全句:输入序列由一个或多个文档的全句组成。 如果一个文档结束,则采用下一个文档的句子并使用额外的分隔符分隔,直到序列的长度最大为512。
Doc-Sentences: This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, the batch size is also varied so as to match the number of tokens to that of the Full-Sentences .
Doc-Sentences:这与Full-Sentences相同,只是序列不跨越文档边界,即,一旦文档结束,下一个句子中的句子就不会添加到序列中。 在这里,由于文档长度是变化的,所以批量大小也将变化,以使令牌的数量与完整句的数量匹配。
Note that the representations in 3 and 4 are NOT trained on the NSP objective.

On comparing the results, first, the segment-pair originally used in Devlin et. al. performs better on downstream tasks than the individual sentence (sentence-pair) representation. However, the doc-sentence setting outperforms the original BERT (BASE) model. Removing the NSP objective matches or slightly improves downstream task performance.
在比较结果时,首先是Devlin等人最初使用的段对。 等 在下游任务上的表现要比单个句子(句子对)的表现更好。 但是,文档句子设置的性能优于原始BERT(BASE)模型。 删除NSP目标可以匹配或稍微提高下游任务的性能。
大批量 (Large Batch Sizes)
Past work has shown that the Transformer and BERT models are amenable to large batch sizes. Having large batch sizes make optimization faster and can improve the end-task performance when tuned correctly (in case of these models).
过去的工作表明,Transformer和BERT模型适用于大批量。 批量较大时,优化速度更快,并且在正确调整时(对于这些模型)可以提高最终任务性能。

Note that, with increasing batch sizes, the training passes are adjusted, i.e. a given sequence will ultimately be optimized for the same number of times. For example, batch size of 256 for 1M steps is equivalent to training with a batch size of 2K for 125K steps and with a batch size of 8K for 31K steps.
请注意 ,随着批次大小的增加,将调整训练次数,即最终将优化给定序列相同的次数。 例如,对于1M步为256的批量大小等效于对125K步为2K的批量大小和对于31K步为8K的批量大小进行训练。
代币化 (Tokenization)
For tokenization, RoBERTa uses a byte-level Byte-Pair Encoding (BPE) encoding scheme with a vocabulary containing 50K subword units in contrast to BERT’s character-level BPE with a 30K vocabulary.
结果 (Results)

It is clear that RoBERTa has outperformed the state of the art in almost all of the GLUE tasks including for ensemble models.
放在一起 (Putting it All Together)
RoBERTa is BERT but:
结论 (Conclusion)
We have discussed another state of the art language model and compared it with the benchmark.
Here is a link to the GitHub repository for the open-sourced code of the RoBERTa model.
For the model architecture API and pretrained weights, refer huggingface docs.
