大语言模型生成式AI学习笔记——2. 1.4LLM指令微调——多任务指令微调
然而,它的表现并不像人类生成的基线摘要那样好,后者包括了诸如Mike询问信息以便于办理入住等重要信息,而模型的完成还捏造了原始对话中未包含的信息。虽然FLAN-T5是一个表现良好且适用于多种任务的通用模型,但你可能会发现它在你的特定用例的任务中还有改进的空间。在接下来的视频中,你将了解几个指标和基准,你可以使用它们来确定你的模型表现如何,以及你的微调版本比原始基础模型好多少。在许多轮次的训练中,通
Multi-task instruction fine-tuning(多任务指令微调)
Multitask fine-tuning is an extension of single task fine-tuning, where the training dataset is comprised of example inputs and outputs for multiple tasks. Here, the dataset contains examples that instruct the model to carry out a variety of tasks, including summarization, review rating, code translation, and entity recognition. You train the model on this mixed dataset so that it can improve the performance of the model on all the tasks simultaneously, thus avoiding the issue of catastrophic forgetting. Over many epochs of training, the calculated losses across examples are used to update the weights of the model, resulting in an instruction tuned model that is learned how to be good at many different tasks simultaneously.
One drawback to multitask fine-tuning is that it requires a lot of data. You may need as many as 50-100,000 examples in your training set. However, it can be really worthwhile and worth the effort to assemble this data. The resulting models are often very capable and suitable for use in situations where good performance at many tasks is desirable.
Let's take a look at one family of models that have been trained using multitask instruction fine-tuning. Instruct model variance differ based on the datasets and tasks used during fine-tuning. One example is the FLAN family of models. FLAN, which stands for fine-tuned language net, is a specific set of instructions used to fine-tune different models. Because the FLAN fine-tuning is the last step of the training process the authors of the original paper called it the metaphorical dessert to the main course of pre-training, quite a fitting name. FLAN-T5, the FLAN instruct version of the T5 foundation model while FLAN-PALM is the flattening struct version of the palm foundation model. You get the idea, FLAN-T5 is a great general purpose instruct model. In total, it's been fine tuned on 473 datasets across 146 task categories. Those datasets are chosen from other models and papers as shown here. Don't worry about reading all the details right now. If you're interested, you can access the original paper through a reading exercise after the video and take a closer look.
One example of a prompt dataset used for summarization tasks in FLAN-T5 is SAMSum. It's part of the muffin collection of tasks and datasets and is used to train language models to summarize dialogue.
SAMSum is a dataset with 16,000 messenger like conversations with summaries. Three examples are shown here with the dialogue on the left and the summaries on the right. The dialogues and summaries were crafted by linguists for the express purpose of generating a high-quality training dataset for language models. The linguists were asked to create conversations similar to those that they would write on a daily basis, reflecting their proportion of topics of their real life messenger conversations. Although language experts then created short summaries of those conversations that included important pieces of information and names of the people in the dialogue.
Here is a prompt template designed to work with this SAMSum dialogue summary dataset. The template is actually comprised of several different instructions that all basically ask the model to do this same thing. Summarize a dialogue. For example, briefly summarize that dialogue. What is a summary of this dialogue? What was going on in that conversation? Including different ways of saying the same instruction helps the model generalize and perform better. Just like the prompt templates you saw earlier. You see that in each case, the dialogue from the SAMSum dataset is inserted into the template wherever the dialogue field appears. The summary is used as the label. After applying this template to each row in the SAMSum dataset, you can use it to fine tune a dialogue summarization task.
While FLAN-T5 is a great general use model that shows good capability in many tasks. You may still find that it has room for improvement on tasks for your specific use case. For example, imagine you're a data scientist building an app to support your customer service team, process requests received through a chat bot, like the one shown here.
Your customer service team needs a summary of every dialogue to identify the key actions that the customer is requesting and to determine what actions should be taken in response. The SAMSum dataset gives FLAN-T5 some abilities to summarize conversations. However, the examples in the dataset are mostly conversations between friends about day-to-day activities and don't overlap much with the language structure observed in customer service chats. You can perform additional fine-tuning of the FLAN-T5 model using a dialogue dataset that is much closer to the conversations that happened with your bot. This is the exact scenario that you'll explore in the lab this week. You'll make use of an additional domain specific summarization dataset called dialogsum to improve FLAN-T5's ability to summarize support chat conversations. This dataset consists of over 13,000 support chat dialogues and summaries. The dialogue some dataset is not part of the FLAN-T5 training data, so the model has not seen these conversations before.
Let's take a look at example from dialogsum and discuss how a further round of fine-tuning can improve the model. This is a support chat that is typical of the examples in the dialogsum dataset. The conversation is between a customer and a staff member at a hotel check-in desk. The chat has had a template applied so that the instruction to summarize the conversation is included at the start of the text. Now, let's take a look at how FLAN-T5 responds to this prompt before doing any additional fine-tuning, note that the prompt is now condensed on the left to give you more room to examine the completion of the model. Here is the model's response to the instruction. You can see that the model does as it's able to identify that the conversation was about a reservation for Tommy. However, it does not do as well as the human-generated baseline summary, which includes important information such as Mike asking for information to facilitate check-in and the models completion has also invented information that was not included in the original conversation. Specifically the name of the hotel and the city it was located in.
Now let's take a look at how the model does after fine-tuning on the dialogue some dataset, hopefully, you will agree that this is closer to the human-produced summary. There is no fabricated information and the summary includes all of the important details, including the names of both people participating in the conversation. This example, use the public dialogue, some dataset to demonstrate fine-tuning on custom data.
In practice, you'll get the most out of fine-tuning by using your company's own internal data. For example, the support chat conversations from your customer support application. This will help the model learn the specifics of how your company likes to summarize conversations and what is most useful to your customer service colleagues.
I know there's a lot to take in here. But don't worry, this example is going to be covered in the lab. You'll get a chance to see this in action and try it out for yourself. One thing you need to think about when fine-tuning is how to evaluate the quality of your models completions. In the next video, you'll learn about several metrics and benchmarks that you can use to determine how well your model is performing and how much better you're fine-tuned version is than the original base model.
多任务微调是单任务微调的扩展,训练数据集由多个任务的示例输入和输出组成。这里,数据集包含指示模型执行多种任务的示例,包括摘要、评论评分、代码翻译和实体识别。你在混合数据集上训练模型,以便同时提高模型在所有任务上的性能,从而避免灾难性遗忘的问题。在许多轮次的训练中,通过计算示例的损失来更新模型的权重,结果得到了一个指令调整后的模型,学会了同时擅长许多不同的任务。
多任务微调的一个缺点是它需要大量数据。你可能需要在训练集中有多达50-100,000个示例。然而,组建这些数据的努力可能是非常值得的。最终得到的模型通常非常适合在需要多个任务的良好性能的情况下使用。
让我们来看一下使用多任务指令微调训练的一系列模型。指令模型的变化基于微调期间使用的数据集和任务。一个例子是FLAN系列模型。FLAN代表微调语言网络,是用于微调不同模型的一组特定指令。因为FLAN微调是训练过程的最后一步,所以原始论文的作者称其比喻为“预训练这道主菜的甜点”,这个名字相当贴切。FLAN-T5是T5基础模型的FLAN指令版本,而FLAN-PALM是palm基础模型的扁平结构版本。你明白了,FLAN-T5是一个很棒的通用指令模型。总共,它在146个任务类别的473个数据集上进行了微调。这些数据集选自其他模型和论文,如图所示。现在不用担心阅读所有细节。如果你有兴趣,可以在视频结束后通过阅读练习访问原始论文并仔细查看。
用于FLAN-T5摘要任务的提示数据集的一个例子是SAMSum。它是muffin任务和数据集集合的一部分,用于训练语言模型对对话进行总结。SAMSum是一个包含16,000个类似信使的对话和摘要的数据集。这里显示了三个示例,左侧是对话,右侧是摘要。这些对话和摘要是由语言学家精心制作的,专门为生成高质量的语言模型训练数据集。语言学家被要求创建类似于他们日常写作的对话,反映他们现实生活中信使对话的主题比例。尽管语言专家随后创建了这些对话的简短摘要,其中包括对话中的重要信息和人物名称。这里是一个与这个SAMSum对话摘要数据集一起工作的提示模板。该模板实际上由几个不同的指令组成,所有这些指令基本上都要求模型做同样的事情。例如,简要总结那个对话。这个对话的摘要是什么?对话中发生了什么?包括不同的方式来表达相同的指令有助于模型泛化并表现得更好。就像你之前看到的提示模板一样。你可以看到在每种情况下,SAMSum数据集中的对话都被插入到模板中对话框字段出现的地方。摘要用作标签。在将此模板应用于SAMSum数据集的每一行后,你可以使用它来微调对话摘要任务。
虽然FLAN-T5是一个表现良好且适用于多种任务的通用模型,但你可能会发现它在你的特定用例的任务中还有改进的空间。例如,想象你是一位数据科学家,正在构建一个应用程序来支持你的客户服务团队,处理通过聊天机器人收到的请求,就像这里显示的那样。你的客户服务团队需要对每次对话进行总结,以识别客户请求的关键行动,并确定应该采取什么行动作为回应。SAMSum数据集赋予了FLAN-T5一些总结对话的能力。然而,该数据集中的示例大多是朋友之间关于日常活动的对话,与在客户服务聊天中观察到的语言结构重叠不多。你可以使用更接近与你的机器人发生的对话的对话数据集对FLAN-T5模型进行额外的微调。这就是你本周将在实验室探索的确切场景。你将利用一个名为dialogsum的额外领域特定摘要数据集来提高FLAN-T5总结支持聊天对话的能力。这个数据集包含了超过13,000个支持聊天对话和摘要。dialogsum数据集不是FLAN-T5训练数据的一部分,所以模型之前没有见过这些对话。
让我们来看一个来自dialogsum的例子,并讨论如何通过进一步的微调来改进模型。这是一个典型的dialogsum数据集中的支持聊天示例。对话是在酒店前台的一位顾客和一名员工之间的。聊天已经应用了一个模板,以便在文本开始时包含总结对话的指示。现在,让我们看看在进行任何额外微调之前,FLAN-T5对这一提示的反应,注意提示现在被压缩在左边,以便给你更多的空间来检查模型的完成情况。这是模型对指令的响应。你可以看到模型尽其所能地识别出对话是关于Tommy的预订。然而,它的表现并不像人类生成的基线摘要那样好,后者包括了诸如Mike询问信息以便于办理入住等重要信息,而模型的完成还捏造了原始对话中未包含的信息。特别是酒店的名称和它所在的城市。现在让我们看看模型在对dialogsum数据集进行微调后的表现,希望你会同意这更接近人类产生的摘要。没有捏造的信息,摘要包括了所有重要的细节,包括参与对话的两个人的名字。这个例子使用了公共的dialogsum数据集来演示对自定义数据的微调。
在实践中,通过使用你自己公司内部的数据进行微调,你将获得最佳效果。例如,来自你客户支持应用程序的支持聊天对话。这将帮助模型学习你的公司喜欢如何总结对话以及对你的客户服务同事最有用的内容。我知道这里有很多内容需要消化。但不用担心,这个例子将在实验室中涵盖。你将有机会亲眼看到这个过程并亲自尝试。当你进行微调时,你需要思考的一件事是如何评估模型完成的质量问题。在接下来的视频中,你将了解几个指标和基准,你可以使用它们来确定你的模型表现如何,以及你的微调版本比原始基础模型好多少。
更多推荐
所有评论(0)