返回介绍

数学基础

统计学习

深度学习

工具

Scala

三、从头开始训练因果语言模型

发布于 2023-07-17 23:38:23 字数 26342 浏览 0 评论 0 收藏 0

  1. 首先为 masked language modeling 选择一个合适的预训练模型,如前面用到的 "bert-base-cased"

    
    
    xxxxxxxxxx
    from transformers import AutoModelForMaskedLM model_checkpoint = "bert-base-cased" model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) num_parameters = model.num_parameters() / 1_000_000 print(f"BERT_Base number of parameters: {round(num_parameters)}M'") # BERT_Base number of parameters: 108M'

    现在我们来看看 BERT_Base 如何补全一个被掩码的单词:

    
    
    xxxxxxxxxx
    from transformers import AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) text = "WuHan City a great [MASK]." inputs = tokenizer(text, return_tensors="pt") print(inputs) # { # 'input_ids': tensor([[ 101, 8769, 3048, 1389, 1392, 170, 1632, 103, 119, 102]]), # 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) # } token_logits = model(**inputs).logits # shape: [1, 10, 28996] ##************ 找到 [MASK] 的位置并抽取它的 logits *********** mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] print(mask_token_index) # tokenizer.mask_token_id = 103 # tensor([7]) mask_token_logits = token_logits[0, mask_token_index, :] # shape: [1, 28996] ##************ 返回 [MASK] 的 top-k 候选 *********** top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() for token in top_5_tokens: print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'") # '>>> WuHan City a great city.' # '>>> WuHan City a great place.' # '>>> WuHan City a great town.' # '>>> WuHan City a great village.' # '>>> WuHan City a great name.'

2.1 数据集和数据处理

  1. 加载数据集:我们在 Large Movie Review Dataset: IMDb 上微调 BERT_Base 。该数据集有训练集、测试集、还有 unsupervised 等三个 split

    
    
    xxxxxxxxxx
    from datasets import load_dataset imdb_dataset = load_dataset("imdb") print(imdb_dataset) # DatasetDict({ # train: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # test: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # unsupervised: Dataset({ # features: ['text', 'label'], # num_rows: 50000 # }) # })
  2. 数据处理:对于自回归语言建模、以及掩码语言建模,一个常见的预处理步骤是拼接所有样本,然后将整个语料库拆分为相同大小的 block 。我们还需要保留 word id 序列,以便后续用于全词掩码( whole word masking )。

    
    
    xxxxxxxxxx
    result = tokenizer("Welcome to WuHan City", is_split_into_words=False) # 执行 tokenization print(result) # { # 'input_ids': [101, 12050, 1106, 8769, 3048, 1389, 1392, 102], # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], # 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1] # } print(result.word_ids()) # 每个 token 对应的 word id # [None, 0, 1, 2, 2, 2, 3, None]

    此外,我们删除 text 字段和 label 字段,因为不再需要。我们构建一个函数来执行这些:

    
    
    xxxxxxxxxx
    def tokenize_function(examples): # examples 是一个 batch 的样本 result = tokenizer(examples["text"]) # result 包含 batch 结果 if tokenizer.is_fast: # result.word_ids(i) 返回第 i 个样本的 word id 序列 result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))] return result # 使用 batched=True tokenized_datasets = imdb_dataset.map( tokenize_function, batched=True, remove_columns=["text", "label"] ) print(tokenized_datasets) # word_ids 列是我们人工添加的 # DatasetDict({ # train: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'], # num_rows: 25000 # }) # test: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'], # num_rows: 25000 # }) # unsupervised: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'], # num_rows: 50000 # }) # })

    现在我们已经完成了 tokenization。下一步是将它们拼接在一起然后分块。块的大小怎么选择?这取决于 GPU 的显存大小。此外,还可以参考模型的最大上下文的长度,这可以通过 tokenizer.model_max_length 属性来判断:

    
    
    xxxxxxxxxx
    print(tokenizer.model_max_length) # 512

    然后我们拼接文本并拆分为大小为 block_size 的块。

    
    
    xxxxxxxxxx
    def group_texts(examples, chunk_size = 128): # keys() 为 ('input_ids', 'token_type_ids', 'attention_mask', 'word_ids') concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} # 拼接样本 total_length = len(concatenated_examples[list(examples.keys())[0]]) # 计算总的 token 长度 total_length = (total_length // chunk_size) * chunk_size # 移除最后一个小于 chunk_size 的块(也可以填充最后一个块到 chunk_size 长度) result = { k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)] # 执行分块 for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() # label 就是 input token 序列,因为 MLM 的 label 就是被掩码的 token return result lm_datasets = tokenized_datasets.map(group_texts, batched=True) print(lm_datasets) # DatasetDict({ # train: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 63037 # }) # test: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 61623 # }) # unsupervised: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 126497 # }) # })

    以训练集为例,可以看到,样本数量要比原始的 25k 个样本要多。因为现在的样本是 contiguous token,而不是原始的情感分类样本。

    现在缺少关键的步骤:在输入的随机位置插入 [MASK] token 。这需要在训练过程中动态地插入,而不是静态地提前准备好。

2.2 使用 Trainer API 微调模型

  1. 如前所述,我们需要再训练过程中动态地在输入的随机位置插入 [MASK] token 。这需要一个特殊的 data collator 从而可以在训练过程中动态地随机掩码输入文本中的一些 token,即 DataCollatorForLanguageModeling 。我们需要向它传入 mlm_probability 参数从而指定 masked token 的占比。我们选择 15%,因为这是论文中常用的配置:

    
    
    xxxxxxxxxx
    from transformers import DataCollatorForLanguageModeling data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) samples = [lm_datasets["train"][i] for i in range(2)] for sample in samples: _ = sample.pop("word_ids") # 移除 word_ids ,否则 data_collator(samples) 抛出异常 for chunk in data_collator(samples)["input_ids"]: print(f"\n'>>> {tokenizer.decode(chunk)}'") # >>> [CLS] I rented I AM [MASK] deliberate [MASK]US - YEL [MASK]OW from my video store because of all thedating that surrounded it when it [MASK] first released in 1967. I also heard [MASK] [MASK] first it was seized by U. S. [MASK] if it ever tried to enter this [MASK], [MASK] being a fan of films [MASK] " controversial " I really had to see this for myself. < br / [MASK] < br / > The plot [MASK] centered around a young Swedish drama student named Lena who neighbouring to learn everything she can about [MASK]. In particular [MASK] wants [MASK] focus her attention [MASK] to making some sort of documentary on [MASK] the average Swed # >>> ##e thought about [MASK] political [MASK] such as [MASK] Vietnam War and [MASK] issues in [MASK] [MASK] States. In between asking politicians [MASK] ordinary [MASK] [MASK]mony of Stockholm about their opinions on politics, she [MASK] [MASK] with [MASK] drama [MASK] [MASK] classmates, [MASK] married men. [MASK] br / Quaker < br / > What kills me about I AM CURIOUS - YELLOW is that 40 years ago, this was considered pornographic. Really [MASK] [MASK] sex and nudi [MASK] scenes are few and far between [MASK] even then it's not shot like some cheaply made [MASK]orno [MASK] While my countrymen mind find it shocking, in [MASK]

    随机掩码的一个副作用是,当使用 Trainer 时,我们的评估指标可能是随机的(每一次评估的结果可能各不相同),因为测试集使用的也是相同的 DataCollatorForLanguageModeling 。然而,我们可以利用 Accelerate 来自定义训练过程(而不是 Trainer 封装好的训练过程),从而在训练过程中冻结随机性。

  2. 全词掩码 whole word masking: WWM:全词掩码是掩码整个单词,而不仅是是掩码单词内的单个 token 。如果我们想使用全词掩码,我们需要自己构建一个 data collator 。此时,我们需要用到之前计算的 word_ids,它给出了每个 token 对应的 word id 。注意,除了与 [MASK] 对应的 label 以外,所有的其他 label 都是 -100

    
    
    xxxxxxxxxx
    import collections import numpy as np from transformers import default_data_collator wwm_probability = 0.2 def whole_word_masking_data_collator(features): for feature in features: word_ids = feature.pop("word_ids") mapping = collections.defaultdict(list) # 存放 word_id 到它包含的 token id list 的映射 current_word_index = -1 current_word = None for idx, word_id in enumerate(word_ids): if word_id is not None: if word_id != current_word: current_word = word_id current_word_index += 1 mapping[current_word_index].append(idx) # 随机掩码 word mask = np.random.binomial(1, wwm_probability, (len(mapping),)) # 注意,单个元素的元组 (xxx, ) input_ids = feature["input_ids"] labels = feature["labels"] new_labels = [-100] * len(labels) # 默认全为 -100 for word_id in np.where(mask)[0]: # np.where(mask) 返回一个元组, word_id = word_id.item() for idx in mapping[word_id]: # 被掩码的单词所对应的 token_id new_labels[idx] = labels[idx] input_ids[idx] = tokenizer.mask_token_id feature["labels"] = new_labels return default_data_collator(features) samples = [lm_datasets["train"][i] for i in range(2)] batch = whole_word_masking_data_collator(samples) for chunk in batch["input_ids"]: print(f"\n'>>> {tokenizer.decode(chunk)}'") # >>> [CLS] I rented I AM [MASK] [MASK] [MASK] [MASK] - YELLOW from my [MASK] store because of all the controversy that [MASK] it when it was first released [MASK] [MASK]. [MASK] also heard that at first it was [MASK] [MASK] U [MASK] S [MASK] [MASK] if it [MASK] tried to enter this country, therefore [MASK] a fan [MASK] [MASK] [MASK] " [MASK] " I really had [MASK] see this for myself [MASK] [MASK] br / [MASK] < br / > The plot [MASK] [MASK] around a young [MASK] drama student [MASK] Lena who wants to learn everything she [MASK] [MASK] life. In particular she wants to [MASK] her attentions to making some sort of documentary [MASK] what [MASK] average Swed # >>> ##e thought about certain political issues such as the Vietnam War and race issues in [MASK] United States [MASK] In between asking politicians and ordinary denizens of [MASK] about their opinions on politics, [MASK] has [MASK] with [MASK] [MASK] teacher [MASK] [MASK], and married men. [MASK] [MASK] / [MASK] < br [MASK] > What kills me about I [MASK] CURIOUS - [MASK] [MASK] [MASK] [MASK] is that 40 years ago, this was considered pornographic. Really, the sex [MASK] nudity scenes are few and [MASK] [MASK] [MASK] even then it's [MASK] [MASK] like some cheaply made [MASK] [MASK] [MASK]. [MASK] my countrymen mind find [MASK] [MASK], in [MASK]
  3. 数据集缩小:为了演示的方便,我们将训练集缩小为数千个样本。

    
    
    xxxxxxxxxx
    train_size = 10_000 test_size = int(0.1 * train_size) downsampled_dataset = lm_datasets["train"].train_test_split( train_size=train_size, test_size=test_size, seed=42 ) print(downsampled_dataset) # DatasetDict({ # train: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 10000 # }) # test: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 1000 # }) # })
  4. 配置 Trainer :接下来我们可以登录 Hugging Face Hub (可选的,方式为在命令行中执行命令 huggingface-cli login)。

    
    
    xxxxxxxxxx
    from transformers import TrainingArguments batch_size = 64 logging_steps = len(downsampled_dataset["train"]) // batch_size # 每个 epoch 打印 training loss model_name = model_checkpoint.split("/")[-1] training_args = TrainingArguments( output_dir=f"{model_name}-finetuned-imdb", overwrite_output_dir=True, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, num_train_epochs=3, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, push_to_hub=False, # 这里暂时先不 push 到 hub fp16=True, # 混合精度训练从而加速训练过程 logging_steps=logging_steps, # 设置 logging_steps remove_unused_columns = False, # 用于 WWM data_collator ) from transformers import Trainer trainer = Trainer( model=model, args=training_args, train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], # data_collator=data_collator, data_collator=whole_word_masking_data_collator # WWM data_collator )

    默认情况下, Trainer 将删除不属于模型的 forward() 方法的列。这意味着, 如果你使用 WWM data_collator ,你还需要设置 remove_unused_columns = False,以确保我们不会在训练期间丢失 word_ids 列。

  5. 语言模型的困惑度( perplexity ):一个好的语言模型是为语法正确的句子分配高概率,为无意义的句子分配低概率。我们通过困惑度来衡量这种概率。困惑度有多种数学定义,这里我们采用交叉熵损失的指数。因此,我们可以通过 Trainer.evaluate() 函数计算测试集上的交叉熵损失,然后取结果的指数来计算预训练模型的困惑度:

    
    
    xxxxxxxxxx
    import math eval_results = trainer.evaluate() print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}") # >>> Perplexity: 39.75 trainer.train() eval_results = trainer.evaluate() print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}") # >>> Perplexity: 22.11 trainer.push_to_hub() tokenizer.save_pretrained(output_dir) # 保存 tokenizer

    较低的困惑度分数意味着更好的语言模型。可以看到:模型困惑度降低了很多。这表明模型已经了解了一些关于电影评论领域的知识。

  6. 使用模型:现在可以通过 Transformerspipeline 来调用微调后的模型:

    
    
    xxxxxxxxxx
    from transformers import pipeline mask_filler = pipeline( # 注意,tokenizer 也需要保存在这个目录下 "fill-mask", model="./bert-base-cased-finetuned-imdb/checkpoint-471" ) preds = mask_filler("WuHan City a great [MASK].") for pred in preds: print(f">>> {pred['sequence']}") # >>> WuHan City a great city. # >>> WuHan City a great place. # >>> WuHan City a great town. # >>> WuHan City a great one. # >>> WuHan City a great name.

2.3 自定义训练过程

  1. DataCollatorForLanguageModeling 对每次评估过程采用随机掩码,因此每次训练运行时,我们都会看到困惑度分数的一些波动。消除这种随机性来源的一种方法是在整个测试集上应用一次掩码,然后使用 Transformers 中的默认 data collator

    注意,自定义训练过程人工创建了 dataloader,而 Trainer API 只需要传入 dataset 而无需创建 dataloader

    整体代码如下所示:

    
    
    xxxxxxxxxx
    from transformers import AutoModelForMaskedLM from transformers import AutoTokenizer from tqdm.auto import tqdm import torch import math from datasets import load_dataset from transformers import DataCollatorForLanguageModeling import collections import numpy as np from transformers import default_data_collator from torch.utils.data import DataLoader from torch.optim import AdamW from accelerate import Accelerator from transformers import get_scheduler ##********** 加载 pre-trained model, tokenizer 和数据集 ******** model_checkpoint = "bert-base-cased" model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) imdb_dataset = load_dataset("imdb") ##********* tokenization **************** def tokenize_function(examples): # examples 是一个 batch 的样本 result = tokenizer(examples["text"]) # result 包含 batch 结果 if tokenizer.is_fast: # result.word_ids(i) 返回第 i 个样本的 word id 序列 result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))] return result tokenized_datasets = imdb_dataset.map( tokenize_function, batched=True, remove_columns=["text", "label"] ) ##********* 分块 **************** def group_texts(examples, chunk_size = 128): # keys() 为 ('input_ids', 'token_type_ids', 'attention_mask', 'word_ids') concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} # 拼接样本 total_length = len(concatenated_examples[list(examples.keys())[0]]) # 计算总的 token 长度 total_length = (total_length // chunk_size) * chunk_size # 移除最后一个小于 chunk_size 的块(也可以填充最后一个块到 chunk_size 长度) result = { k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)] # 执行分块 for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() # label 就是 input token 序列,因为 MLM 的 label 就是被掩码的 token return result lm_datasets = tokenized_datasets.map(group_texts, batched=True) train_size = 10_000 test_size = int(0.1 * train_size) downsampled_dataset = lm_datasets["train"].train_test_split( # demo: 减小数据规模 train_size=train_size, test_size=test_size, seed=42 ) ##********** 创建 WWM data collator *********** # data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) wwm_probability = 0.2 def whole_word_masking_data_collator(features): for feature in features: word_ids = feature.pop("word_ids") mapping = collections.defaultdict(list) # 存放 word_id 到它包含的 token id list 的映射 current_word_index = -1 current_word = None for idx, word_id in enumerate(word_ids): if word_id is not None: if word_id != current_word: current_word = word_id current_word_index += 1 mapping[current_word_index].append(idx) # 随机掩码 word mask = np.random.binomial(1, wwm_probability, (len(mapping),)) # 注意,单个元素的元组 (xxx, ) input_ids = feature["input_ids"] labels = feature["labels"] new_labels = [-100] * len(labels) # 默认全为 -100 for word_id in np.where(mask)[0]: # np.where(mask) 返回一个元组, word_id = word_id.item() for idx in mapping[word_id]: # 被掩码的单词所对应的 token_id new_labels[idx] = labels[idx] input_ids[idx] = tokenizer.mask_token_id feature["labels"] = new_labels return default_data_collator(features) ##************ 对测试集进行静态掩码 *************** def insert_random_mask(batch): features = [dict(zip(batch, t)) for t in zip(*batch.values())] masked_inputs = whole_word_masking_data_collator(features) # 对于数据集中的每一列,创建一个对应的 masked 列 return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()} # downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"]) # 如果是 whole_word_masking_data_collator 则注释掉这一行 eval_dataset = downsampled_dataset["test"].map( insert_random_mask, batched=True, remove_columns=downsampled_dataset["test"].column_names, ) eval_dataset = eval_dataset.rename_columns( { "masked_input_ids": "input_ids", "masked_attention_mask": "attention_mask", "masked_labels": "labels", } ).remove_columns(["masked_token_type_ids"]) # 移除一些不需要的列 ##************ 创建 data loader *************** batch_size = 64 train_dataloader = DataLoader( downsampled_dataset["train"], shuffle=True, batch_size=batch_size, collate_fn=whole_word_masking_data_collator, ) eval_dataloader = DataLoader( eval_dataset, batch_size=batch_size, collate_fn=default_data_collator ) ##************ 创建训练组件*************** optimizer = AdamW(model.parameters(), lr=5e-5) accelerator = Accelerator() model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader ) num_train_epochs = 3 num_update_steps_per_epoch = len(train_dataloader) num_training_steps = num_train_epochs * num_update_steps_per_epoch lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps, ) ##********** 在 Hugging Face Hub 上创建一个模型库 (可以忽略) ********** # from huggingface_hub import get_full_repo_name model_name = "%s-finetuned-imdb-accelerate"%model_checkpoint # repo_name = get_full_repo_name(model_name) # repo_name # rom huggingface_hub import Repository output_dir = model_name # repo = Repository(output_dir, clone_from=repo_name) ##************ 训练和评估 **************** progress_bar = tqdm(range(num_training_steps)) for epoch in range(num_train_epochs): # Training model.train() for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) # Evaluation model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss losses.append(accelerator.gather(loss)) # 跨进程收集每个样本的 loss losses = torch.cat(losses) losses = losses[: len(eval_dataset)] # 获取验证集每个样本的 loss try: perplexity = math.exp(torch.mean(losses)) except OverflowError: perplexity = float("inf") print(f">>> Epoch {epoch}: Perplexity: {perplexity}") # >>> Epoch 0: Perplexity: 22.54525292335159 # >>> Epoch 1: Perplexity: 21.186613045279536 # >>> Epoch 2: Perplexity: 20.757056615284373 ##*********** 保存、上传微调好的模型 **************** accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save) # 用 accelerator.save if accelerator.is_main_process: tokenizer.save_pretrained(output_dir) # repo.push_to_hub( # commit_message=f"Training in progress epoch {epoch}", blocking=False # )

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文