用 RLHF 训练、微调大模型，训练自己的gpt4（一）：模型微调（SFT）

相关的代码可以在github上访问：[github.com/night-is-yo…]本文主要是介绍第一部分, 微调sft官方的例子：[github.com/huggingface…]

Python怎么学啊

1031人浏览 · 2024-07-12 09:33:09

Python怎么学啊 · 2024-07-12 09:33:09 发布

大模型的微调主要有以下几个方面:

有监督的微调 (Supervised Fine-tuning，SFT)。
奖励 / 偏好建模 (Reward / preference modeling，RM)。
基于人类反馈的强化学习 (RLHF)。

相关的代码可以在github上访问：[github.com/night-is-yo…]

本文主要实现了4种模型：

baichuan
chatglm3
qwen
yi

本文主要是介绍第一部分, 微调

sft官方的例子：[github.com/huggingface…]

parser = HfArgumentParser((ScriptArguments, TrainingArguments, ModelConfig))
args, training_args, model_config = parser.parse_args_into_dataclasses()
training_args.gradient_checkpointing_kwargs = dict(use_reentrant=False)

################
# Model & Tokenizer
################
torch_dtype = (
    model_config.torch_dtype
    if model_config.torch_dtype in ["auto", None]
    else getattr(torch, model_config.torch_dtype)
)
quantization_config = get_quantization_config(model_config)
model_kwargs = dict(
    revision=model_config.model_revision,
    trust_remote_code=model_config.trust_remote_code,
    attn_implementation=model_config.attn_implementation,
    torch_dtype=torch_dtype,
    use_cache=False if training_args.gradient_checkpointing else True,
    device_map=get_kbit_device_map() if quantization_config is not None else None,
    quantization_config=quantization_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

################
# Dataset
################
raw_datasets = load_dataset(args.dataset_name)
train_dataset = raw_datasets["train"]
eval_dataset = raw_datasets["test"]

################
# Training
################
trainer = SFTTrainer(
    model=model_config.model_name_or_path,
    model_init_kwargs=model_kwargs,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=args.max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    peft_config=get_peft_config(model_config),
)
trainer.train()
trainer.save_model(training_args.output_dir)

本文不建议这么写。

SFTTrainer源码解读

大模型微调主要是使用SFTTrainer，相比于标准的Train，作了一些改变

在初始化时，会自动加载模型，不过建议自己初始化模型，传入

if isinstance(model, str):
    warnings.warn(
        "You passed a model_id to the SFTTrainer. This will automatically create an "
        "`AutoModelForCausalLM` or a `PeftModel` (if you passed a `peft_config`) for you."
    )
    model = AutoModelForCausalLM.from_pretrained(model, **model_init_kwargs)

如果传入peft_config，会自动初始化peft微调模型

if is_peft_available() and peft_config is not None:
    if not isinstance(peft_config, PeftConfig):
        raise ValueError(
            "If you want to use the PeftModel, you need to pass a PeftConfig object to the SFTTrainer."
            f" and you passed a {type(peft_config)}."
        )

    if not isinstance(model, PeftModel):
        _support_gc_kwargs = hasattr(
            args, "gradient_checkpointing_kwargs"
        ) and "gradient_checkpointing_kwargs" in list(
            inspect.signature(prepare_model_for_kbit_training).parameters
        )
        gradient_checkpointing_kwargs = getattr(args, "gradient_checkpointing_kwargs", None) or {}
        if getattr(model, "is_loaded_in_8bit", False) or getattr(model, "is_loaded_in_4bit", False):
            preprare_model_kwargs = {
                "use_gradient_checkpointing": getattr(args, "gradient_checkpointing", False)
            }

            if _support_gc_kwargs:
                preprare_model_kwargs["gradient_checkpointing_kwargs"] = gradient_checkpointing_kwargs

            model = prepare_model_for_kbit_training(model, **preprare_model_kwargs)

            if args is not None:
                args = dataclasses.replace(args, gradient_checkpointing=False)
        elif getattr(args, "gradient_checkpointing", False) and (
            "use_reentrant" not in gradient_checkpointing_kwargs
            or gradient_checkpointing_kwargs["use_reentrant"]
        ):
            # For backward compatibility with older versions of transformers
            if hasattr(model, "enable_input_require_grads"):
                model.enable_input_require_grads()
            else:

                def make_inputs_require_grad(module, input, output):
                    output.requires_grad_(True)

                model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

        model = get_peft_model(model, peft_config)
        if args is not None and args.bf16 and getattr(model, "is_loaded_in_4bit", False):
            peft_module_casting_to_bf16(model)

数据加载是一个比较麻烦的地方

为了高效利用数据，我们采用了称之为打包的技术: 与 batch 中的每个样本均由单一文本组成，最后基于最长的文本来 padding (填充)，我们把很多文本拼接起来，用 EOS token 来隔开，然后分割成一些 chunk (切块) 来做成 batch，避免 padding。

ConstantLengthDataset实现了 “打包” 功能，ConstantLengthDataset的源码如下

class ConstantLengthDataset(IterableDataset):
    def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.max_buffer_size:
                    break
                try:
                    buffer.append(self.formatting_func(next(iterator)))
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    if self.infinite:
                        iterator = iter(self.dataset)
                        warnings.warn("The dataset reached end and the iterator is reset to the start.")
                    else:
                        more_examples = False
                        break
            tokenized_inputs = self.tokenizer(buffer, add_special_tokens=self.add_special_tokens, truncation=False)[
                "input_ids"
            ]
            all_token_ids = []
            for tokenized_input in tokenized_inputs:
                if self.append_concat_token:
                    tokenized_input = tokenized_input + [self.concat_token_id]
                all_token_ids.extend(tokenized_input)
            examples = []
            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    examples.append(input_ids)
            if self.shuffle:
                random.shuffle(examples)
            for example in examples:
                self.current_size += 1
                yield {
                    "input_ids": torch.LongTensor(example),
                    "labels": torch.LongTensor(example),
                }

1.首先为了避免数据量过大，一次加载到内存会内存溢出，因此，每次加载一部分数据

while more_examples:
    buffer, buffer_len = [], 0
    while True:
        if buffer_len >= self.max_buffer_size:
            break
        try:
            buffer.append(self.formatting_func(next(iterator)))
            buffer_len += len(buffer[-1])
        except StopIteration:
            if self.infinite:
                iterator = iter(self.dataset)
                warnings.warn("The dataset reached end and the iterator is reset to the start.")
            else:
                more_examples = False
                break

上面第一个while是为了完整加载数据，第二个while是为了分批量加载，批量的设置在初始化方法中

self.max_buffer_size = seq_length * chars_per_token * num_of_sequences

这里的chars_per_token是为了把字符串转为token数字，一个字符转占用的token数目

2.将所有的数据拼接在一起

tokenized_inputs = self.tokenizer(buffer, add_special_tokens=self.add_special_tokens, truncation=False)[
                "input_ids"
            ]
all_token_ids = []
for tokenized_input in tokenized_inputs:
    if self.append_concat_token:
        tokenized_input = tokenized_input + [self.concat_token_id]
    all_token_ids.extend(tokenized_input)

3.将拼接的数据切块（chunk）

examples = []
for i in range(0, len(all_token_ids), self.seq_length):
    input_ids = all_token_ids[i : i + self.seq_length]
    if len(input_ids) == self.seq_length:
        examples.append(input_ids)

如何学习AI大模型？

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述

第一阶段：从大模型系统设计入手，讲解大模型的主要方法；

第二阶段：在通过大模型提示词工程从Prompts角度入手更好发挥模型的作用；

第三阶段：大模型平台应用开发借助阿里云PAI平台构建电商领域虚拟试衣系统；

第四阶段：大模型知识库应用开发以LangChain框架为例，构建物流行业咨询智能问答系统；

第五阶段：大模型微调开发借助以大健康、新零售、新媒体领域构建适合当前领域大模型；

第六阶段：以SD多模态大模型为主，搭建了文生图小程序案例；

第七阶段：以大模型平台应用与开发为主，通过星火大模型，文心大模型等成熟大模型构建大模型行业应用。

在这里插入图片描述

👉学会后的收获：👈
• 基于大模型全栈工程实现（前端、后端、产品经理、设计、数据分析等），通过这门课可获得不同能力；

• 能够利用大模型解决相关实际项目需求：大数据时代，越来越多的企业和机构需要处理海量数据，利用大模型技术可以更好地处理这些数据，提高数据分析和决策的准确性。因此，掌握大模型应用开发技能，可以让程序员更好地应对实际项目需求；

• 基于大模型和企业数据AI应用开发，实现大模型理论、掌握GPU算力、硬件、LangChain开发框架和项目实战技能，学会Fine-tuning垂直训练大模型（数据准备、数据蒸馏、大模型部署）一站式掌握；

• 能够完成时下热门大模型垂直领域模型训练能力，提高程序员的编码能力：大模型应用开发需要掌握机器学习算法、深度学习框架等技术，这些技术的掌握可以提高程序员的编码能力和分析能力，让程序员更加熟练地编写高质量的代码。

在这里插入图片描述

1.AI大模型学习路线图
2.100套AI大模型商业化落地方案
3.100集大模型视频教程
4.200本大模型PDF书籍
5.LLM面试题合集
6.AI产品经理资源合集

👉获取方式：
😝有需要的小伙伴，可以保存图片到wx扫描二v码免费领取【保证100%免费】🆓

在这里插入图片描述

智源数据社区

更多推荐

自然语言处理(NLP)-下游任务&数据集：语言模型、机器翻译、问答、文本分类、情感分析、文本生成、自动摘要、命名实体识别、阅读理解、自然语言推理、信息提取、词性标注、共指消解、实体链接【＞200项】

智源数据社区

利用科大讯飞开放平台进行自然语言处理（NLP）Python

最近在做聊天机器人的人工智能实践，需要用到依存句法分析和语义依存分析，所以利用强大的中文语言技术平台注册及快速入门网址 https://www.xfyun.cn/快速入门文档 https://www.xfyun.cn/doc/platform/quickguide.htmlIP白名单设置运行demo时，会出现类似{"code":"10105","data":{},"desc":"ill...