


笔者介绍一个新的样本数据生成的项目Bonito(https://github.com/BatsResearch/bonito),一个用于条件任务生成的开源模型,它可以将未标注的文本转换为特定任务的训练数据集,用于指令微调。根据论文介绍,该模型本身是在 mistralai/Mistral-7B-v0.1 的基础上,利用包含 165 万个示例的数据集(https://huggingface.co/datasets/BatsResearch/ctga-v1)进行微调,支持多种任务类型,包括多选题回答、是非题回答、自然语言推理、主题分类等。



class Bonito(LLM, AbstractBonito):`    `def generate_tasks(`        `self,`        `text_dataset: Dataset,`        `context_col: str,`        `task_type: str,`        `sampling_params: SamplingParams,`        `**kwargs,`    `):`        `"""`        `Generates tasks using the Bonito model.``   `        `This method takes a text dataset, a context column name,`        `a task type, and sampling parameters, and generates tasks`        `using the Bonito model. It processes the input dataset,`        `generates outputs, collects multiple generations into`        `one dataset object, and filters out the examples that`        `cannot be parsed.``   `        `Args:`            `text_dataset (Dataset): The dataset that provides the text`                `for the tasks.`            `context_col (str): The name of the column in the dataset`                `that provides the context for the tasks.`            `task_type (str): The type of the tasks. This can be a`                `short form or a full form.`            `sampling_params (SamplingParams): The parameters for`                `sampling.`            `**kwargs: Additional keyword arguments.``   `        `Returns:`            `Dataset: The synthetic dataset with the generated tasks.`        `"""`        `processed_dataset = self._prepare_bonito_input(`            `text_dataset, task_type, context_col, **kwargs`        `)`        `outputs = self.generate(processed_dataset["input"], sampling_params)``   `        `# collect multiple generations into one dataset object`        `examples = []`        `for i, example in enumerate(text_dataset.to_list()):`            `for output in outputs[i].outputs:`                `examples.append(`                    `{"context": example[context_col], "prediction": output.text.strip()}`                `)``   `        `synthetic_dataset = Dataset.from_list(examples)``   `        `# filter out the examples that cannot be parsed`        `synthetic_dataset = self._postprocess_dataset(`            `synthetic_dataset, context_col="context", **kwargs`        `)``   `        `return synthetic_dataset


from bonito import Bonito``from vllm import SamplingParams``from datasets import load_dataset``   ``# Initialize the Bonito model``bonito = Bonito("BatsResearch/bonito-v1")``   ``# load dataset with unannotated text``unannotated_text = load_dataset(`    `"BatsResearch/bonito-experiment",`    `"unannotated_contract_nli"``)["train"].select(range(10))``   ``# Generate synthetic instruction tuning dataset``sampling_params = SamplingParams(max_tokens=256, top_p=0.95, temperature=0.5, n=1)``synthetic_dataset = bonito.generate_tasks(`    `unannotated_text,`    `context_col="input",`    `task_type="nli",`    `sampling_params=sampling_params``)


from typing import Optional, List, Dict``from datasets import Dataset``from awq import AutoAWQForCausalLM``from bonito import AbstractBonito``from transformers import AutoTokenizer``   ``   ``class QuantizedBonito(AbstractBonito):`    `def __init__(self, model_name_or_path):`        `self.model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True).cuda()`        `self.tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)``   `    `def generate_task(`        `self,`        `unannotated_paragraph: str,`        `task_type: str,`        `sampling_params: dict,`    `) -> Dict:`        `"""`        `Generates synthetic instruction tuning pair using the Quantized Bonito model.`        `This method takes a text unannotated text, a task type, and sampling parameters,`        `and generates synthetic input-output pair.``   `        `Args:`            `unannotated_paragraph (str): The unannotated text or a paragraph`            `task_type (str): The type of the tasks. This can be a`                `short form or a full form.`            `sampling_params (dict): The parameters for`                `sampling.`            `**kwargs: Additional keyword arguments.``   `        `Returns:`            `Dict: The synthetic input-output pair for the task type.`        `"""``   `        `text_dataset = Dataset.from_list([{"input": unannotated_paragraph}])``   `        `processed_dataset = self._prepare_bonito_input(`            `text_dataset, task_type, context_col="input"`        `)``   `        `outputs = self._generate_text(processed_dataset["input"], sampling_params)`        `examples = []`        `for i, example in enumerate(text_dataset.to_list()):`            `output = outputs[i]`            `example["prediction"] = output.strip()`            `examples.append(example)``   `        `synthetic_dataset = Dataset.from_list(examples)``   `        `# filter out the examples that cannot be parsed`        `synthetic_dataset_dict = self._postprocess_dataset(`            `synthetic_dataset, context_col="input"`        `).to_list()[0]``   `        `return synthetic_dataset_dict``   `    `def _generate_text(`        `self,`        `dataset: Dataset,`        `sampling_params: dict,`        `) -> List[str]:`        `"""`        `Generate text using huggingface transformers generate function.``   `        `This method takes a dataset of prompts, encodes them,`        `generates text using the model, decodes the generated`        `text, and appends it to a list.``   `        `Args:`            `dataset (Dataset): A dataset containing prompts for text generation.`            `sampling_params (dict): Parameters for sampling during generation.``   `        `Returns:`            `List[str]: A list of generated texts corresponding to the prompts.`        `"""`        `generated_texts = []``   `        `for prompt in dataset:`            `input_ids = self.tokenizer.encode(prompt, return_tensors="pt")`            `input_ids = input_ids.cuda()``   `            `output = self.model.generate(`                `input_ids,`                `do_sample=True,`                `**sampling_params`            `)``   `            `generated_text = self.tokenizer.decode(output[0][len(input_ids[0]):], skip_special_tokens=True)`            `generated_texts.append(generated_text)``   `        `return generated_texts


\# Generate synthetic instruction tuning dataset`sampling_params = {'max_new_tokens':256, 'top_p':0.95, 'temperature':0.7, 'num_return_sequences':1}``synthetic_dataset = bonito.generate_task(`    `unannotated_paragraph,`    `task_type="ynqa",`    `sampling_params=sampling_params``)``pprint("----Generated Instructions----")``pprint(f'Input: {synthetic_dataset["input"]}')``pprint(f'Output: {synthetic_dataset["output"]}')``   ``'----Generated Instructions----'``('Input: Based on the following passage, is a written communication '` `'confidential? 1. “Confidential Information”, whenever used in this '` `'Agreement, shall mean any data, document, specification and other '` `'information or material, that is delivered or disclosed by UNHCR to the '` `'Recipient in any form whatsoever, whether orally, visually in writing or '` `'otherwise (including computerized form), and that, at the time of disclosure '` `'to the Recipient, is designated as confidential.')``'Output: Yes'`


  1. 提取式问答(exqa):根据给定的文本片段生成问题答案,直接从文本中提取答案。

  2. 多选问题回答(mcqa):提供一组多选问题的答案。

  3. 问题生成(qg):根据提供的文本内容创建问题。

  4. 无选择问答(qa):在不提供多项选择选项的情况下回答问题。

  5. 是-否问题回答(ynqa):生成问题的是或否答案。

  6. 共指消解 (coref):标识文本中引用同一实体的引用。

  7. 释义生成 (paraphrase):重写具有不同措辞的句子或短语,同时保留原意。

  8. 释义识别 (paraphrase_id):确定两个句子或短语是否传达相同的含义。

  9. 句子补全(sent_comp):补全句子中缺失的部分。

  10. 情感分析 (sentiment):识别文本中表达的情绪,如积极、消极或中性。

  11. 摘要(summarization):将较长的文本浓缩成较短的摘要,抓住要点。

  12. 文本生成(Text_gen):基于提示创建连贯且与上下文相关的文本。

  13. 主题分类(Topic_class):将文本分类为预定义的主题。

  14. 词义消歧(wsd):根据上下文确定单词的含义。

  15. 文本蕴含(te):预测一个给定的文本是否在逻辑上遵循另一个文本。

  16. 自然语言推理(nli):确定两段文本之间的关系,如矛盾、隐含或中性。





