没有标记数据集，如何做大模型指令微调？介绍一款有潜力的标记数据集生成模型

在构建大模型应用时，通常有两种方式来改进效果，一种是构建外部知识库，利用RAG来完成。但RAG并不是万能的，对于特定领域的LLM应用，以及无需示例，就能完成特定任务等场合就需要进行微调。然而，微调本身相较于RAG来讲，需要更多的算力资源和时间周期，但更大的瓶颈在于微调需要标记过的样本数据。这对于很多企业来讲，很难有这样高质量的数据积累，他们的数据通常是未经标记的，可能是一篇一篇的文章或者规章制度，

程序员笑武

1061人浏览 · 2024-06-29 20:24:56

程序员笑武 · 2024-06-29 20:24:56 发布

为了完成微调，传统做法就是通过人工的方式进行问答对构造，在此基础上斯坦福研究团队也提出了Alpaca使用GPT-4这样的强模型模仿种子样本生成标记数据集。

https://arxiv.org/pdf/2402.18334

笔者介绍一个新的样本数据生成的项目Bonito（https://github.com/BatsResearch/bonito），一个用于条件任务生成的开源模型，它可以将未标注的文本转换为特定任务的训练数据集，用于指令微调。根据论文介绍，该模型本身是在 mistralai/Mistral-7B-v0.1 的基础上，利用包含 165 万个示例的数据集（https://huggingface.co/datasets/BatsResearch/ctga-v1）进行微调，支持多种任务类型，包括多选题回答、是非题回答、自然语言推理、主题分类等。

Benito项目本身是一个数据生成的LLM应用，模型由vllm加速，使用方法比较简单。基本过程为将文档内容提取出来（datasets），比如PDF等，然后指定生成任务类型，并将其传给bonito.generate_task即可。

Bonito定义：

class Bonito(LLM, AbstractBonito):`    `def generate_tasks(`        `self,`        `text_dataset: Dataset,`        `context_col: str,`        `task_type: str,`        `sampling_params: SamplingParams,`        `**kwargs,`    `):`        `"""`        `Generates tasks using the Bonito model.``   `        `This method takes a text dataset, a context column name,`        `a task type, and sampling parameters, and generates tasks`        `using the Bonito model. It processes the input dataset,`        `generates outputs, collects multiple generations into`        `one dataset object, and filters out the examples that`        `cannot be parsed.``   `        `Args:`            `text_dataset (Dataset): The dataset that provides the text`                `for the tasks.`            `context_col (str): The name of the column in the dataset`                `that provides the context for the tasks.`            `task_type (str): The type of the tasks. This can be a`                `short form or a full form.`            `sampling_params (SamplingParams): The parameters for`                `sampling.`            `**kwargs: Additional keyword arguments.``   `        `Returns:`            `Dataset: The synthetic dataset with the generated tasks.`        `"""`        `processed_dataset = self._prepare_bonito_input(`            `text_dataset, task_type, context_col, **kwargs`        `)`        `outputs = self.generate(processed_dataset["input"], sampling_params)``   `        `# collect multiple generations into one dataset object`        `examples = []`        `for i, example in enumerate(text_dataset.to_list()):`            `for output in outputs[i].outputs:`                `examples.append(`                    `{"context": example[context_col], "prediction": output.text.strip()}`                `)``   `        `synthetic_dataset = Dataset.from_list(examples)``   `        `# filter out the examples that cannot be parsed`        `synthetic_dataset = self._postprocess_dataset(`            `synthetic_dataset, context_col="context", **kwargs`        `)``   `        `return synthetic_dataset

基本使用：

from bonito import Bonito``from vllm import SamplingParams``from datasets import load_dataset``   ``# Initialize the Bonito model``bonito = Bonito("BatsResearch/bonito-v1")``   ``# load dataset with unannotated text``unannotated_text = load_dataset(`    `"BatsResearch/bonito-experiment",`    `"unannotated_contract_nli"``)["train"].select(range(10))``   ``# Generate synthetic instruction tuning dataset``sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)``synthetic_dataset = bonito.generate_tasks(`    `unannotated_text,`    `context_col="input",`    `task_type="nli",`    `sampling_params=sampling_params``)

如果想要在显存较小的GPU上运行，如T4，可对模型进行量化。

from typing import Optional, List, Dict``from datasets import Dataset``from awq import AutoAWQForCausalLM``from bonito import AbstractBonito``from transformers import AutoTokenizer``   ``   ``class QuantizedBonito(AbstractBonito):`    `def __init__(self, model_name_or_path):`        `self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()`        `self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)``   `    `def generate_task(`        `self,`        `unannotated_paragraph: str,`        `task_type: str,`        `sampling_params: dict,`    `) -> Dict:`        `"""`        `Generates synthetic instruction tuning pair using the Quantized Bonito model.`        `This method takes a text unannotated text, a task type, and sampling parameters,`        `and generates synthetic input-output pair.``   `        `Args:`            `unannotated_paragraph (str): The unannotated text or a paragraph`            `task_type (str): The type of the tasks. This can be a`                `short form or a full form.`            `sampling_params (dict): The parameters for`                `sampling.`            `**kwargs: Additional keyword arguments.``   `        `Returns:`            `Dict: The synthetic input-output pair for the task type.`        `"""``   `        `text_dataset = Dataset.from_list([{"input": unannotated_paragraph}])``   `        `processed_dataset = self._prepare_bonito_input(`            `text_dataset, task_type, context_col="input"`        `)``   `        `outputs = self._generate_text(processed_dataset["input"], sampling_params)`        `examples = []`        `for i, example in enumerate(text_dataset.to_list()):`            `output = outputs[i]`            `example["prediction"] = output.strip()`            `examples.append(example)``   `        `synthetic_dataset = Dataset.from_list(examples)``   `        `# filter out the examples that cannot be parsed`        `synthetic_dataset_dict = self._postprocess_dataset(`            `synthetic_dataset, context_col="input"`        `).to_list()[0]``   `        `return synthetic_dataset_dict``   `    `def _generate_text(`        `self,`        `dataset: Dataset,`        `sampling_params: dict,`        `) -> List[str]:`        `"""`        `Generate text using huggingface transformers generate function.``   `        `This method takes a dataset of prompts, encodes them,`        `generates text using the model, decodes the generated`        `text, and appends it to a list.``   `        `Args:`            `dataset (Dataset): A dataset containing prompts for text generation.`            `sampling_params (dict): Parameters for sampling during generation.``   `        `Returns:`            `List[str]: A list of generated texts corresponding to the prompts.`        `"""`        `generated_texts = []``   `        `for prompt in dataset:`            `input_ids = self.tokenizer.encode(prompt, return_tensors="pt")`            `input_ids = input_ids.cuda()``   `            `output = self.model.generate(`                `input_ids,`                `do_sample=True,`                `**sampling_params`            `)``   `            `generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)`            `generated_texts.append(generated_text)``   `        `return generated_texts

以tasktype为ynqa，即yes-or-no问题为例，其生成的结果如下：

\# Generate synthetic instruction tuning dataset`sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.7, 'num_return_sequences':1}``synthetic_dataset = bonito.generate_task(`    `unannotated_paragraph,`    `task_type="ynqa",`    `sampling_params=sampling_params``)``pprint("----Generated Instructions----")``pprint(f'Input: {synthetic_dataset["input"]}')``pprint(f'Output: {synthetic_dataset["output"]}')``   ``'----Generated Instructions----'``('Input: Based on the following passage, is a written communication '` `'confidential? 1. “Confidential Information”, whenever used in this '` `'Agreement, shall mean any data, document, specification and other '` `'information or material, that is delivered or disclosed by UNHCR to the '` `'Recipient in any form whatsoever, whether orally, visually in writing or '` `'otherwise (including computerized form), and that, at the time of disclosure '` `'to the Recipient, is designated as confidential.')``'Output: Yes'`

其中，tasktype支持的任务类型如下：

提取式问答（exqa）：根据给定的文本片段生成问题答案，直接从文本中提取答案。
多选问题回答（mcqa）：提供一组多选问题的答案。
问题生成（qg）：根据提供的文本内容创建问题。
无选择问答（qa）：在不提供多项选择选项的情况下回答问题。
是-否问题回答（ynqa）：生成问题的是或否答案。
共指消解 (coref)：标识文本中引用同一实体的引用。
释义生成 (paraphrase)：重写具有不同措辞的句子或短语，同时保留原意。
释义识别 (paraphrase_id)：确定两个句子或短语是否传达相同的含义。
句子补全（sent_comp）：补全句子中缺失的部分。
情感分析 (sentiment)：识别文本中表达的情绪，如积极、消极或中性。
摘要(summarization)：将较长的文本浓缩成较短的摘要，抓住要点。
文本生成（Text_gen）：基于提示创建连贯且与上下文相关的文本。
主题分类（Topic_class）：将文本分类为预定义的主题。
词义消歧（wsd）：根据上下文确定单词的含义。
文本蕴含（te）：预测一个给定的文本是否在逻辑上遵循另一个文本。
自然语言推理（nli）：确定两段文本之间的关系，如矛盾、隐含或中性。

在性能上，相较于GPT-4的方案，bonito在三个数据集中两个上取得了超越GPT4的好成绩。

小结：

相较于使用GPT-4生成标记样本的方法，经过专门面向数据集生成微调的模型Bonito来讲，支持zero-shot级别的样本生成，并且可以使用开源的模型，这在开放性，成本、性能上都能具备较强的优势。

随着微调技术的不断普及，相信数据样本质量和生产成本将受到越来越多的重视，benito等这样的数据集生成模型也将迎来更大的发展。

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述

第一阶段（10天）：初阶应用

该阶段让大家对大模型 AI有一个最前沿的认识，对大模型 AI 的理解超过 95% 的人，可以在相关讨论时发表高级、不跟风、又接地气的见解，别人只会和 AI 聊天，而你能调教 AI，并能用代码将大模型和业务衔接。

大模型 AI 能干什么？
大模型是怎样获得「智能」的？
用好 AI 的核心心法
大模型应用业务架构
大模型应用技术架构
代码示例：向 GPT-3.5 灌入新知识
提示工程的意义和核心思想
Prompt 典型构成
指令调优方法论
思维链和思维树
Prompt 攻击和防范
…

第二阶段（30天）：高阶应用

该阶段我们正式进入大模型 AI 进阶实战学习，学会构造私有知识库，扩展 AI 的能力。快速开发一个完整的基于 agent 对话机器人。掌握功能最强的大模型开发框架，抓住最新的技术进展，适合 Python 和 JavaScript 程序员。

为什么要做 RAG
搭建一个简单的 ChatPDF
检索的基础概念
什么是向量表示（Embeddings）
向量数据库与向量检索
基于向量检索的 RAG
搭建 RAG 系统的扩展知识
混合检索与 RAG-Fusion 简介
向量模型本地部署
…

第三阶段（30天）：模型训练

恭喜你，如果学到这里，你基本可以找到一份大模型 AI相关的工作，自己也能训练 GPT 了！通过微调，训练自己的垂直大模型，能独立训练开源多模态大模型，掌握更多技术方案。

到此为止，大概2个月的时间。你已经成为了一名“AI小子”。那么你还想往下探索吗？

为什么要做 RAG
什么是模型
什么是模型训练
求解器 & 损失函数简介
小实验2：手写一个简单的神经网络并训练它
什么是训练/预训练/微调/轻量化微调
Transformer结构简介
轻量化微调
实验数据集的构建
…

第四阶段（20天）：商业闭环

对全球大模型从性能、吞吐量、成本等方面有一定的认知，可以在云端和本地等多种环境下部署大模型，找到适合自己的项目/创业方向，做一名被 AI 武装的产品经理。

硬件选型
带你了解全球大模型
使用国产大模型服务
搭建 OpenAI 代理
热身：基于阿里云 PAI 部署 Stable Diffusion
在本地计算机运行大模型
大模型的私有化部署
基于 vLLM 部署大模型
案例：如何优雅地在阿里云私有部署开源大模型
部署一套开源 LLM 项目
内容安全
互联网信息服务算法备案
…

学习是一个过程，只要学习就会有挑战。天道酬勤，你越努力，就会成为越优秀的自己。

如果你能在15天内完成所有的任务，那你堪称天才。然而，如果你能完成 60-70% 的内容，你就已经开始具备成为一名大模型 AI 的正确特征了。

这份完整版的大模型 AI 学习资料已经上传CSDN，朋友们如果需要可以微信扫描下方CSDN官方认证二维码免费领取【`保证100%免费`】

在这里插入图片描述

智源数据社区

更多推荐

自然语言处理(NLP)-下游任务&数据集：语言模型、机器翻译、问答、文本分类、情感分析、文本生成、自动摘要、命名实体识别、阅读理解、自然语言推理、信息提取、词性标注、共指消解、实体链接【＞200项】

智源数据社区

利用科大讯飞开放平台进行自然语言处理（NLP）Python

最近在做聊天机器人的人工智能实践，需要用到依存句法分析和语义依存分析，所以利用强大的中文语言技术平台注册及快速入门网址 https://www.xfyun.cn/快速入门文档 https://www.xfyun.cn/doc/platform/quickguide.htmlIP白名单设置运行demo时，会出现类似{"code":"10105","data":{},"desc":"ill...