Parameter efficient fine-tuning (PEFT) Overview(参数高效微调概要)

As you saw in the first week of the course, training LLMs is computationally intensive. Full fine-tuning requires memory not just to store the model, but various other parameters that are required during the training process. Even if your computer can hold the model weights, which are now on the order of hundreds of gigabytes for the largest models, you must also be able to allocate memory for optimizer states, gradients, forward activations, and temporary memory throughout the training process. These additional components can be many times larger than the model and can quickly become too large to handle on consumer hardware.

In contrast to full fine-tuning where every model weight is updated during supervised learning, parameter efficient fine tuning methods only update a small subset of parameters. Some path techniques freeze most of the model weights and focus on fine tuning a subset of existing model parameters, for example, particular layers or components. Other techniques don't touch the original model weights at all, and instead add a small number of new parameters or layers and fine-tune only the new components. With PEFT, most if not all of the LLM weights are kept frozen. As a result, the number of trained parameters is much smaller than the number of parameters in the original LLM. In some cases, just 15-20% of the original LLM weights. This makes the memory requirements for training much more manageable. In fact, PEFT can often be performed on a single GPU. And because the original LLM is only slightly modified or left unchanged, PEFT is less prone to the catastrophic forgetting problems of full fine-tuning. Full fine-tuning results in a new version of the model for every task you train on. Each of these is the same size as the original model, so it can create an expensive storage problem if you're fine-tuning for multiple tasks.

Let's see how you can use PEFT to improve the situation. With parameter efficient fine-tuning, you train only a small number of weights, which results in a much smaller footprint overall, as small as megabytes depending on the task. The new parameters are combined with the original LLM weights for inference. The PEFT weights are trained for each task and can be easily swapped out for inference, allowing efficient adaptation of the original model to multiple tasks. There are several methods you can use for parameter efficient fine-tuning, each with trade-offs on parameter efficiency, memory efficiency, training speed, model quality, and inference costs.

Let's take a look at the three main classes of PEFT methods. Selective methods are those that fine-tune only a subset of the original LLM parameters. There are several approaches that you can take to identify which parameters you want to update. You have the option to train only certain components of the model or specific layers, or even individual parameter types. Researchers have found that the performance of these methods is mixed and there are significant trade-offs between parameter efficiency and compute efficiency. We won't focus on them in this course.

Reparameterization methods also work with the original LLM parameters, but reduce the number of parameters to train by creating new low rank transformations of the original network weights. A commonly used technique of this type is LoRA, which we'll explore in detail in the next video. Lastly, additive methods carry out fine-tuning by keeping all of the original LLM weights frozen and introducing new trainable components. Here there are two main approaches. Adapter methods add new trainable layers to the architecture of the model, typically inside the encoder or decoder components after the attention or feed-forward layers. Soft prompt methods, on the other hand, keep the model architecture fixed and frozen, and focus on manipulating the input to achieve better performance. This can be done by adding trainable parameters to the prompt embeddings or keeping the input fixed and retraining the embedding weights.

In this lesson, you'll take a look at a specific soft prompts technique called prompt tuning. First, let's move on to the next video and take a closer look at the LoRA method and see how it reduces the memory required for training.

正如你在课程的第一周所见,训练大型语言模型(LLMs)在计算上是密集型的。全量微调不仅需要内存来存储模型,还需要存储训练过程中所需的各种其他参数。即使你的计算机能够容纳模型权重——对于最大的模型来说,现在的数量级已经是数百GB——你还必须能够在训练过程中为优化器状态、梯度、前向激活和临时内存分配内存。这些额外的组件可能比模型本身大许多倍,并且可能很快就会变得太大,以至于消费者硬件无法处理。

与全量微调不同,后者在监督学习过程中更新每个模型权重,参数高效的微调方法只更新一小部分参数。一些路径技术冻结了大部分模型权重,专注于微调现有模型参数的一个子集,例如特定的层或组件。其他技术根本不会触及原始模型权重,而是添加少量新的参数或层,并且只对新组件进行微调。使用PEFT(Parameter Efficient Fine-tuning),大多数(如果不是全部)LLM权重都保持冻结。因此,训练参数的数量比原始LLM中的参数数量少得多。在某些情况下,只有原始LLM权重的15-20%。这使得训练所需的内存要求更容易管理。实际上,PEFT通常可以在单个GPU上执行。而且由于原始LLM只是被轻微修改或保持不变,PEFT不太容易出现全量微调的灾难性遗忘问题。全量微调会导致你训练的每个任务都有一个新的模型版本。每一个都和原始模型一样大,所以如果你为多个任务进行微调,它可能会造成昂贵的存储问题。

让我们来看看你可以如何使用PEFT来改善这种情况。通过参数高效的微调,你只训练少量的权重,这导致整体占用空间小得多,根据任务的不同,小到只有几MB。新的参数与原始LLM权重结合用于推断。PEFT权重针对每个任务进行训练,并且可以轻易地替换出来进行推断,允许原始模型高效地适应多个任务。你可以使用几种方法进行参数高效的微调,每种方法在参数效率、内存效率、训练速度、模型质量和推断成本上都有不同的权衡。

让我们来看一下PEFT方法的三个主要类别。选择性方法是那些只微调原始LLM参数的一个子集。你可以采取几种方法来确定你想要更新哪些参数。你可以选择只训练模型的某些组件或特定层,甚至是单个参数类型。研究人员发现这些方法的性能参差不齐,并且在参数效率和计算效率之间存在重大的权衡。我们不会在这门课程中关注它们。重参数化方法也是使用原始LLM参数,但通过创建原始网络权重的新低秩转换来减少要训练的参数数量。这种类型的一个常用技术是LoRA,我们将在下一个视频中详细探讨。最后,附加方法是通过保持所有原始LLM权重冻结并引入新的可训练组件来进行微调。这里有两个主要的方法。适配器方法在模型的架构中添加新的可训练层,通常是在编码器或解码器组件的注意力或前馈层之后。另一方面,软提示方法保持模型架构固定和冻结,并专注于操纵输入以获得更好的性能。这可以通过向提示嵌入添加可训练参数或将输入保持固定并重新训练嵌入权重来完成。

在这节课中,你将看一下一种称为提示调整的具体软提示技术。首先,让我们继续下一个视频,更仔细地看看LoRA方法,并了解它是如何减少训练所需内存的。

Logo

更多推荐