【大模型】基于Unsloth微调Llama-3.1 8b代码详解

Unsloth是一个开源的大模型训练加速项目，使用OpenAI的Triton对模型的计算过程进行重写，大幅提升模型的训练速度，降低训练中的显存占用。

酒酿小圆子～

877人浏览 · 2024-08-06 14:34:58

酒酿小圆子～ · 2024-08-06 14:34:58 发布

文章目录

1、加载模型和分词器
2、LoRA adapter
3、数据准备
4、训练模型
5、模型推理
- 5.1 直接推理
- 5.2 基于 TextStreamer 推理
6、保存/加载 LORA 模型
- 6.1 保存 LoRA Adapter
- 6.2 加载 LoRA Adapter
7、Saving to float16 for VLLM
8、GGUF / llama.cpp Conversion
参考资料

Unsloth是一个开源的大模型训练加速项目，使用OpenAI的Triton对模型的计算过程进行重写，大幅提升模型的训练速度，降低训练中的显存占用。

Unsloth Github项目：https://github.com/unslothai/unsloth
基于Unsloth微调Llama-3.1 8b源代码官方colab地址：https://colab.research.google.com/drive/1Ys44kVvmeZtnICzWz0xgpRnrIOjZAuxp?usp=sharing#scrollTo=2eSvM9zX_2d3
Unsloth的安装方式参考博客：【大模型】Unsloth安装及使用教程

1、加载模型和分词器

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

输出如下：

在这里插入图片描述

【代码解读】：
（1）代码中基于 unsloth 的 FastLanguageModel.from_pretrained() 加载了模型和分词器，能够显著提升模型和分词器加载速度。

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

（2）这里，本文也给出传统的基于Hugging Face的 transformers 的模型和分词器加载方式，以此来对比一下：

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = './model/llama-3-8b'   # 模型的本地路径
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

2、LoRA adapter

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

输出如下：

在这里插入图片描述

【代码解读】：
（1）代码中基于 unsloth 的 FastLanguageModel.get_peft_model() 的方式增加了 LoRA adapter，后续该模型作为参数传入 SFTTrainer 中。

（2）这里，本文也给出传统的基于 peft 的get_peft_model 的增加LoRA adapter 的方式，以此来对比一下：

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)

model = get_peft_model(model, lora_config)
# model.print_trainable_parameters()

传统方式中，通过调用 get_peft_model 方法包装基础的 Transformer 模型。可以进一步通过 model.print_trainable_parameters 方法查看可训练参数的数量以及占比（相比原始模型参数大幅减少）。

3、数据准备

数据集：这里使用的数据集为 yahma，该数据集是基于原始 Alpaca 数据筛选过滤出52K条数据得到，数据主页为：https://huggingface.co/datasets/yahma/alpaca-cleaned
注意事项：需要在 tokenized output 后面加上 EOS_TOKEN，否则代码将陷入无限迭代生成中。

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

在这里插入图片描述

【代码解读】：

（1）这里，训练数据集 yahma/alpaca-cleaned 采用了 Alpaca 系列数据集的 prompt 格式，由 Instruction、 Input、Response组成。

（2）格式化函数 formatting_prompts_func 的作用为将训练语料正确处理成符合预训练模型规则的字符串，该函数后续需要传入SFTTrainer 类中。

这里，输入语料数据无法直接输入模型中，需要先基于格式化函数 formatting_prompts_func 转换成规范化的字符串，再转换成 token，才能输入到模型中。

（3）这里官方还给出了一些其他类型任务的数据准备及prompt示范样例：

基于 ShareGPT 的对话任务（conversation task）的llama-3 template，可以参考：https://colab.research.google.com/drive/1XamvWYinY6FOSX9GLvnqSjjsNflxdhNc?usp=sharing
针对文本补全任务（text completions task）的mistral-7b template，可以参考：https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing#scrollTo=QmUBVEnvCDJv

目前常见微调数据集的格式包括以下几种：指令跟随格式、多轮对话格式，以及其他辅助格式。

指令跟随格式：指令跟随形式是指用户输入指令，模型按照指令的要求输出结果的格式。这种形式的数据集通常采用json文件格式存储，典型的如Alpaca-52k数据集。Alpaca的格式有两类，一类是instruction/output格式，另一类为instruction/input/output格式。
多轮对话格式：多轮对话形式是指用户和模型之间以对话的形式进行，模型将通过与用户进行多轮的交互最终来达到用户的需求。典型的如训练Vicuna模型 [6] 所使用的ShareGPT数据集。
其他辅助格式：除了上述提到的数据格式，还有一些数据格式不易转化为对话形式，例如纯文本文档。另外，还有一些针对特定用途的数据集，例如文本总结数据集以及根据纯文本生成对话的数据集。

关于LLM中Prompt的介绍，参考博客： [NLP]LLM—大模型指令微调中的“Prompt”

4、训练模型

4.1 实例化 SFTrainer 类

基于 Huggingface 的 TRL 库中的 SFTTrainer 类来训练模型，SFTTrainer 的官方文档：https://huggingface.co/docs/trl/sft_trainer

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

输出如下：

在这里插入图片描述

【代码解读】：
（1） SFTTrainer 类中传入的 model 为我们前面在 LORA 步骤中定义的模型（可训练参数的数量少），以实现参数高效微调。

4.2 启动训练

trainer_stats = trainer.train()

输出如下：

在这里插入图片描述

4.3 显存占用情况

#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

输出如下：

在这里插入图片描述

5、模型推理

5.1 直接推理

# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

输出如下：

在这里插入图片描述

5.2 基于 TextStreamer 推理

You can also use a TextStreamer for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Continue the fibonnaci sequence.", # instruction
        "1, 1, 2, 3, 5, 8", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

输出如下：

在这里插入图片描述

6、保存/加载 LORA 模型

6.1 保存 LoRA Adapter

本地保存，保存至本地路径

model.save_pretrained("lora_model") # Local saving
tokenizer.save_pretrained("lora_model")

在线保存，并推送至 Huggingface 远程hub

model.push_to_hub("your_name/lora_model", token = "...") # Online saving
tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

6.2 加载 LoRA Adapter

官网中给了两种方式：

方式1：基于Unsloth的FastLanguageModel

如果想加载我们刚刚保存用于推理的LoRA适配器，请将False设置为True：

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

输出如下：

在这里插入图片描述

方式2：基于Hugging Face的 AutoModelForPeftCausalLM

在未安装 unsloth 的情况下，可以使用基于Hugging Face的 AutoModelForPeftCausalLM的方式来加载模型，不过相比于Unsloth加载速度会慢很多，且不支持4bit模型。

You can also use Hugging Face’s AutoModelForPeftCausalLM. Only use this if you do not have unsloth installed. It can be hopelessly slow, since 4bit model downloading is not supported, and Unsloth’s inference is 2x faster.

if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

7、Saving to float16 for VLLM

We also support saving to float16 directly. Select merged_16bit for float16 or merged_4bit for int4. We also allow lora adapters as a fallback. Use push_to_hub_merged to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

8、GGUF / llama.cpp Conversion

# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "", # Get a token at https://huggingface.co/settings/tokens
    )

Now, use the model-unsloth.gguf file or model-unsloth-Q4_K_M.gguf file in llama.cpp or a UI based system like GPT4All. You can install GPT4All by going here.

参考资料

基于Unsloth微调Llama-3的官方资料：

实战过程遇到的小问题，也可以参考以下博客：

智源数据社区

更多推荐

自然语言处理(NLP)-下游任务&数据集：语言模型、机器翻译、问答、文本分类、情感分析、文本生成、自动摘要、命名实体识别、阅读理解、自然语言推理、信息提取、词性标注、共指消解、实体链接【＞200项】

智源数据社区

利用科大讯飞开放平台进行自然语言处理（NLP）Python

最近在做聊天机器人的人工智能实践，需要用到依存句法分析和语义依存分析，所以利用强大的中文语言技术平台注册及快速入门网址 https://www.xfyun.cn/快速入门文档 https://www.xfyun.cn/doc/platform/quickguide.htmlIP白名单设置运行demo时，会出现类似{"code":"10105","data":{},"desc":"ill...