RoBERTa：稳健优化的BERT预训练方法_python_weixin

BERT (Devlin et. al.) is a pioneering Language Model that is pretrained for a Denoising Autoencoding objective to produce state of the art results in many NLP tasks. However, there is still room for improvement in the original BERT model w.r.t its pretraining objectives, the data on which it is trained, the duration for which it is trained, etc. These issues were identified by Facebook AI Research (FAIR), and hence, they proposed an ‘optimized’ and ‘robust’ version of BERT.

BERT( Devlin等人 )是一种开创性的语言模型，已针对去噪自动编码目标进行了预训练，以在许多NLP任务中产生最先进的结果。但是，原始BERT模型的预训练目标，训练数据，训练持续时间等仍然有待改进。这些问题已由Facebook AI Research(FAIR)确定 ，因此，他们提出了BERT的“ 优化 ”和“ 健壮 ”版本。

In this article we’ll be discussing RoBERTa: Robustly Optimized BERT-Pretraining Approach proposed in Liu et. al. which is an extension to the original BERT model. The prerequisite for this article would be general awareness about BERT’s architecture, pretraining and fine-tuning objectives, which by default includes sufficient awareness about the Transformer model (Vaswani et. al.).

在本文中，我们将讨论罗伯塔，R obustlyØptimized BERT - P 再培训接近角提出了刘等。等这是对原始BERT模型的扩展。本文的先决条件是对BERT的体系结构，预训练和微调目标有全面的了解，默认情况下，其中包括对Transformer模型的充分了解( Vaswani等人 )。

I have already covered Transformers in this article; and BERT in this article. Consider giving them a read if you’re interested.

我已经在本文中介绍了《变形金刚》。和BERT在本文中。如果您有兴趣，可以考虑给他们阅读。

罗伯塔 (RoBERTa)

If I were to summarize the RoBERTa paper in one line:

如果我将RoBERTa论文总结为一行：

It essentially includes fine-tuning the original BERT model along with data and inputs manipulation.

它实质上包括对原始BERT模型以及数据和输入操作进行微调。

Yep, that’s it! The authors of RoBERTa suggest that BERT is largely undertrained and hence, they put forth some improvements for the same. In the upcoming sections, we’ll discuss the whats and hows of this fine-tuning.

是的，就是这样！ RoBERTa的作者建议对BERT进行充分的培训，因此，他们为BERT提出了一些改进。在接下来的部分中，我们将讨论此微调的内容和方式。

数据 (Data)

It has been observed that training BERT on larger datasets, greatly improves its performance. So RoBERTa is trained on a vast dataset that goes over 160GB of uncompressed text. This dataset is composed of the following corpora:

已经观察到，在较大的数据集上训练BERT可以大大提高其性能。因此，RoBERTa接受了超过160GB的未压缩文本的庞大数据集的训练。该数据集由以下语料库组成：

BookCorpus + English Wikipedia (16GB): This is the data on which BERT is trained.

BookCorpus +英语维基百科 (16GB) ：这是BERT训练所依据的数据。
CC-News (76GB): The authors have collected this data from the English portion of the CommonCrawl News Data. It contains 63M English news articles crawled between September 2016 and February 2019.

CC-News(76GB) ：作者已从CommonCrawl新闻数据的英语部分收集了此数据。它包含2016年9月至2019年2月之间爬行的6300万篇英语新闻文章。
OpenWebText (38GB): Open Source recreation of the WebText dataset used to train OpenAI GPT.

OpenWebText(38GB) ：用于训练OpenAI GPT的WebText数据集的开源重新创建。
Stories (31GB): A subset of CommonCrawl data filtered to match the story-like style of Winograd schemas.

故事( 31GB ) ：经过过滤的CommonCrawl数据的一部分，以匹配Winograd模式的类似故事的样式。

静态v / s动态遮罩 (Static v/s Dynamic Masking)

The masked language modeling objective in BERT pretraining is essentially masking a few tokens from each sequence at random and then predicting these tokens. However, in the original implementation of BERT, the sequences are masked just once in the preprocessing. This implies that the same masking pattern is used for the same sequence in all the training steps.

BERT预训练中的屏蔽语言建模目标实质上是从每个序列中随机屏蔽一些标记，然后预测这些标记。但是，在BERT的原始实现中，序列在预处理中仅被屏蔽了一次。这意味着在所有训练步骤中，相同的掩蔽图案用于相同的序列。

To avoid this, in the re-implementation of BERT, the authors duplicated the training data 10 times so that each sequence was masked in 10 different patterns. This was trained for 40 epochs, i.e. each sequence was trained for the same masking patterns 4 times.

为了避免这种情况，在重新实现BERT的过程中，作者将训练数据重复了10次，因此每个序列被掩盖了10种不同的模式。这被训练了40个时期，即每个序列被训练了4次相同的掩蔽模式。

In addition to this, dynamic masking was tried, wherein a masking pattern is generated every time a sequence is fed to the model.

除此之外，还尝试了动态遮罩 ，其中每次将序列输入模型时都会生成遮罩图案。

The aforementioned results show that the re-implementation with static masking yields almost the same results as that of original BERT’s masking approach. Dynamic masking has comparable or slightly better results than the static approaches. Hence in RoBERTa, the dynamic masking approach is adopted for pretraining.

前述结果表明，使用静态掩膜的重新实现产生的结果几乎与原始BERT掩膜方法的结果相同。与静态方法相比，动态掩膜具有可比的结果或稍微更好的结果。因此，在RoBERTa中，采用动态掩膜方法进行预训练。

输入表示法和下一句预测 (Input Representations and Next Sentence Prediction)

The original BERT paper suggests that the Next Sentence Prediction (NSP) task is essential for obtaining the best results from the model. Recent studies have questioned the necessity of this objective in pretraining.

最初的BERT论文表明，下一句预测(NSP)任务对于从模型中获得最佳结果至关重要。最近的研究质疑这一目标在预训练中的必要性。

So, now we’ll see the different types of input representations that could be used with BERT and how they’d help with eliminating the NSP objective in pretraining:

因此，现在我们将看到可以与BERT一起使用的不同类型的输入表示形式，以及它们如何帮助消除预训练中的NSP目标：

Segment-Pair + NSP: This is the input representation used in the BERT implementation. Each input has a pair of segments (segments, not sentences) from either the original document or some different document at random with a probability of 0.5 and then these are trained for a textual entailment or a Natural Language Inference (NLI) objective. The total combined length must be < 512 tokens (which is the maximum fixed sequence length for the BERT model).

Segment-Pair + NSP：这是BERT实现中使用的输入表示形式。每个输入都有一对来自原始文档或某些不同文档的片段(句段，而不是句子)，概率为0.5，然后随机进行训练，然后针对文本含义或自然语言推论(NLI)目标进行训练。组合的总长度必须小于512个令牌(这是BERT模型的最大固定序列长度)。
Sentence-Pair + NSP: Same as the segment-pair representation, just with pairs of sentences. However, it is evident that the total length of sequences here would be a lot less than 512. Hence a larger batch size is used so that the number of tokens processed per training step is similar to that in the segment-pair representation.

句子对+ NSP：与句段对表示相同，只是句子对。但是，很明显，这里的序列总长度将比512小得多。因此，使用了较大的批处理大小，因此每个训练步骤处理的令牌数量类似于段对表示形式。
Full-Sentences: Input sequences consist of full sentences from one or more documents. If one document ends, then sentences from the next document are taken and separated using an extra separator token until the length of the sequence is at most 512.

全句：输入序列由一个或多个文档的全句组成。如果一个文档结束，则采用下一个文档的句子并使用额外的分隔符分隔，直到序列的长度最大为512。
Doc-Sentences: This is the same as Full-Sentences, just that the sequence doesn’t cross document boundaries, i.e. once the document is over, sentences from the next ones aren’t added to the sequence. Here, since the document lengths are varying, the batch size is also varied so as to match the number of tokens to that of the Full-Sentences .

Doc-Sentences：这与Full-Sentences相同，只是序列不跨越文档边界，即，一旦文档结束，下一个句子中的句子就不会添加到序列中。在这里，由于文档长度是变化的，所以批量大小也将变化，以使令牌的数量与完整句的数量匹配。

Note that the representations in 3 and 4 are NOT trained on the NSP objective.

需要注意的是在3和4的表示没有受过训练的NSP目标。

On comparing the results, first, the segment-pair originally used in Devlin et. al. performs better on downstream tasks than the individual sentence (sentence-pair) representation. However, the doc-sentence setting outperforms the original BERT (BASE) model. Removing the NSP objective matches or slightly improves downstream task performance.

在比较结果时，首先是Devlin等人最初使用的段对。等在下游任务上的表现要比单个句子(句子对)的表现更好。但是，文档句子设置的性能优于原始BERT(BASE)模型。删除NSP目标可以匹配或稍微提高下游任务的性能。

大批量 (Large Batch Sizes)

Past work has shown that the Transformer and BERT models are amenable to large batch sizes. Having large batch sizes make optimization faster and can improve the end-task performance when tuned correctly (in case of these models).

过去的工作表明，Transformer和BERT模型适用于大批量。批量较大时，优化速度更快，并且在正确调整时(对于这些模型)可以提高最终任务性能。

Note that, with increasing batch sizes, the training passes are adjusted, i.e. a given sequence will ultimately be optimized for the same number of times. For example, batch size of 256 for 1M steps is equivalent to training with a batch size of 2K for 125K steps and with a batch size of 8K for 31K steps.

请注意 ，随着批次大小的增加，将调整训练次数，即最终将优化给定序列相同的次数。例如，对于1M步为256的批量大小等效于对125K步为2K的批量大小和对于31K步为8K的批量大小进行训练。

代币化 (Tokenization)

For tokenization, RoBERTa uses a byte-level Byte-Pair Encoding (BPE) encoding scheme with a vocabulary containing 50K subword units in contrast to BERT’s character-level BPE with a 30K vocabulary.

对于令牌化，与具有30K词汇表的BERT字符级BPE相比，RoBERTa使用字节级的字节对编码(BPE)编码方案，其词汇表包含50K子字单元。