返回介绍

数学基础

统计学习

深度学习

工具

Scala

五、翻译

发布于 2023-07-17 23:38:23 字数 35090 浏览 0 评论 0 收藏 0

  1. 文本摘要将长的文章压缩为摘要,这需要理解文章内容并生成捕获了文档主题的连贯的文本。

4.1 数据集和数据处理

  1. 加载数据集:我们使用 “多语言亚马逊评论语料库” 来创建一个双语的 summarizer 。该语料库由六种语言的亚马逊商品评论组成,通常用于对多语言分类器进行基准测试。然而,由于每条评论都附有一个简短的标题,我们可以使用标题作为我们模型学习的 target 摘要 。

    • 首先下载数据集,这里下载英语和西班牙语的子集:

      
      
      xxxxxxxxxx
      from datasets import load_dataset spanish_dataset = load_dataset("amazon_reviews_multi", "es") english_dataset = load_dataset("amazon_reviews_multi", "en") print(english_dataset) # DatasetDict({ # train: Dataset({ # features: ['review_id', 'product_id', 'reviewer_id', # 'stars', 'review_body', 'review_title', 'language', 'product_category'], # num_rows: 200000 # }) # validation: Dataset({ # features: ['review_id', 'product_id', 'reviewer_id', # 'stars', 'review_body', 'review_title', 'language', 'product_category'], # num_rows: 5000 # }) # test: Dataset({ # features: ['review_id', 'product_id', 'reviewer_id', # 'stars', 'review_body', 'review_title', 'language', 'product_category'], # num_rows: 5000 # }) # })

      可以看到,对每种语言,train split200k 条评论、validation split5k 条评论、test split5k 条评论。我们感兴趣的评论信息在 review_bodyreview_title 字段。

      我们可以创建一个简单的函数来查看一些样本:

      
      
      xxxxxxxxxx
      def show_samples(dataset, num_samples=3, seed=42): sample = dataset["train"].shuffle(seed=seed).select(range(num_samples)) for example in sample: print(f"\n'>> Title: {example['review_title']}'") print(f"'>> Review: {example['review_body']}'") show_samples(english_dataset) # '>> Title: Worked in front position, not rear' # '>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.' # ....
    • 然后,我们进行样本过滤。在单个 GPU 上训练所有 400k 条评论(两种语言,每种语言的训练集包含 200k 条评论)的摘要模型将花费太长时间,这里我们选择书籍(包括电子书)类目的评论。

      
      
      xxxxxxxxxx
      def filter_books(example): return ( example["product_category"] == "book" or example["product_category"] == "digital_ebook_purchase" ) spanish_books = spanish_dataset.filter(filter_books) english_books = english_dataset.filter(filter_books) show_samples(english_books) # '>> Title: I'm dissapointed.' # '>> Review: I guess I had higher expectations for this book from the reviews. I really thought I'd at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I'm dissapointed.' # ....
    • 然后,我们需要将英语评论和西班牙语评论合并为一个 DatasetDict 对象:

      
      
      xxxxxxxxxx
      from datasets import concatenate_datasets, DatasetDict books_dataset = DatasetDict() for split in english_books.keys(): books_dataset[split] = concatenate_datasets( [english_books[split], spanish_books[split]]) books_dataset[split] = books_dataset[split].shuffle(seed=42) show_samples(books_dataset) # '>> Title: Easy to follow!!!!' # '>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.' # ....

      现在,train/validation/test split 都是英语和西班牙语的混合评论。

    • 现在,我们过滤掉太短的标题。如果reference 摘要(在这里就是标题)太短,则使得模型偏向于仅生成包含一两个单词的摘要。

      
      
      xxxxxxxxxx
      books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2)
  2. 预处理数据:现在需要对评论极其标题进行 tokenization 和编码。

    • 首先加载 tokenizer。这里我们使用 mt5-base 模型。

      
      
      xxxxxxxxxx
      from transformers import AutoTokenizer model_checkpoint = "/mnt/disk_b/ModelZoo/mt5-base" # 提前下载到本地 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) inputs = tokenizer("I really enjoy reading!") print(inputs) # {'input_ids': [336, 259, 4940, 9070, 11807, 309, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]} print(tokenizer.convert_ids_to_tokens(inputs.input_ids)) # ['▁I', '▁', 'really', '▁enjoy', '▁reading', '!', '</s>']

      特殊的 Unicode 字符 和序列结束符 </s> 表明我们正在处理 SentencePiece tokenizerSentencePiece tokenizer 基于 Unigram tokenization 算法,该算法对多语言语料库特别有用,因为它允许 SentencePiece 在不知道重音、标点符号以及没有空格分隔字符(例如中文)的情况下对文本进行 tokenization

    • 为了对文本进行 tokenization,我们必须处理与摘要相关的细节:因为 label 也是文本,它也可能超过模型的最大上下文大小。这意味着我们需要同时对评论和标题进行截断,确保不会将太长的输入传递给模型。Transformers 中的 tokenizer 提供了一个 as_target_tokenizer() 函数,从而允许你相对于 input 并行地对 label 进行 tokenize

      
      
      xxxxxxxxxx
      max_input_length = 512 # 评论的长度的上限 max_target_length = 30 # 标题的长度的上限 def preprocess_function(examples): model_inputs = tokenizer( examples["review_body"], max_length=max_input_length, truncation=True ) # Set up the tokenizer for targets with tokenizer.as_target_tokenizer(): labels = tokenizer( examples["review_title"], max_length=max_target_length, truncation=True ) model_inputs["labels"] = labels["input_ids"] return model_inputs tokenized_datasets = books_dataset.map(preprocess_function, batched=True)

4.2 评估方法

  1. 评估指标:衡量文本生成任务(如文本摘要、翻译)的性能并不那么简单。最常用的指标之一是 Recall-Oriented Understudy for Gisting Evaluation: ROUGE 得分。该指标背后的基本思想是:将生成的摘要与一组参考摘要(通常由人类创建)进行比较。具体而言,假设我们比较如下的两个摘要:

    
    
    xxxxxxxxxx
    generated_summary = "I absolutely loved reading the Hunger Games" reference_summary = "I loved reading the Hunger Games"

    比较它们的一种方法是计算重叠单词的数量,在这种情况下为 6 。但是,这有点粗糙,因此 ROUGE 是基于重叠的单词来计算 precisionrecall

    • recall:衡量生成的摘要召回了参考摘要(reference summary)中的多少内容。如果只是比较单词,那么 recall 为:

      (1)Recall=Number of overlapping wordsTotal number of words in reference summary
    • precision:衡量生成的摘要中有多少内容是和参考摘要有关。如果只是比较单词,那么 precision 为:

      (2)Recall=Number of overlapping wordsTotal number of words in generated summary

    在实践中,我们通常计算 precisionrecall,然后报告 F1-score 。我们可以安装 rouge_score package,并在 datasets 中调用该指标。

    
    
    xxxxxxxxxx
    # pip install rouge_score (首先安装) from datasets import load_metric rouge_score = load_metric("rouge") scores = rouge_score.compute( predictions=[generated_summary], references=[reference_summary] ) print(scores) # { # 'rouge1': AggregateScore( # low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), # mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), # high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)), # 'rouge2': AggregateScore( # low=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), # mid=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272), # high=Score(precision=0.6666666666666666, recall=0.8, fmeasure=0.7272727272727272)), # 'rougeL': AggregateScore( # low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), # mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), # high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)), # 'rougeLsum': AggregateScore( # low=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), # mid=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923), # high=Score(precision=0.8571428571428571, recall=1.0, fmeasure=0.923076923076923)) # }

    rouge_score.compute() 会一次性计算所有指标。输出的含义如下:

    • 首先,rouge_score 计算了 precision, recall, F1-score 的置信区间,即 low/mid/high 属性。

    • 其次,rouge_score 在比较生成的摘要和参考摘要时,会考虑不同的粒度:rouge1unigram 粒度、rouge2bigram 粒度。rougeLrougeLsum 通过在生成的摘要和参考摘要之间查找最长公共子串,从而得到重叠的单词序列。其中,rougeLsum 表示指标是在整个摘要上计算的,而 rougeL 为单个句子的指标的均值。因为上述例子只有一个句子,因此 rougeLsumrougeL 的输出结果相同。

  2. 强大的 baseline:文本摘要的一个常见baseline 是简单地取一篇文章的前三个句子,通常称之为 lead-3 baseline 。我们可以使用句号(英文使用 ".")来断句,但这在 "U.S." 或者 "U.N." 之类的首字母缩略词上会失败。所以我们将使用 nltk 库,它包含一个更好的算法来处理这些情况。

    
    
    xxxxxxxxxx
    # pip install nltk 首先安装 nltk import nltk nltk.download("punkt") # 下载标点符号规则 from nltk.tokenize import sent_tokenize # 导入 sentence tokenizer def three_sentence_summary(text): return "\n".join(sent_tokenize(text)[:3]) # 提取前三个句子 print(three_sentence_summary(books_dataset["train"][1]["review_body"])) # I grew up reading Koontz, and years ago, I stopped,convinced i had "outgrown" him. # Still,when a friend was looking for something suspenseful too read, I suggested Koontz. # She found Strangers.

    由于文本摘要任务的约定是用换行符来分隔每个摘要,因此我们这里用 "\n" 来拼接前三个句子。

    然后我们实现一个函数,该函数从数据集中提取 lead-3 摘要并计算 baselineROUGE 得分:

    
    
    xxxxxxxxxx
    def evaluate_baseline(dataset, metric): summaries = [three_sentence_summary(text) for text in dataset["review_body"]] return metric.compute(predictions=summaries, references=dataset["review_title"])

    然后我们可以使用这个函数来计算验证集的 ROUGE 分数:

    
    
    xxxxxxxxxx
    score = evaluate_baseline(books_dataset["validation"], rouge_score) rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"] rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names) print(rouge_dict) # {'rouge1': 16.77, 'rouge2': 8.87, 'rougeL': 15.55, 'rougeLsum': 15.92}

    我们可以看到 rouge2 分数明显低于其他的 rouge 分数。 这可能反映了这样一个事实,即评论标题通常很简洁,因此 lead-3 baseline 过于冗长。

4.3 使用 Trainer API 微调模型

  1. 我们首先加载预训练模型。由于文本摘要是一个 seq-to-seq 的任务,我们可以使用 AutoModelForSeq2SeqLM 类加载模型:

    
    
    xxxxxxxxxx
    from transformers import AutoModelForSeq2SeqLM model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, device_map='auto') # mt5-base 占用 2.5G 显存

    对于 seq-to-seq 任务,AutoModelForSeq2SeqLM 模型保留了所有的网络权重。相反,文本分类任务重,预训练模型的 head 被随机初始化的网络所替代。

  2. 然后我们定义超参数和其它参数。我们使用专用的 Seq2SeqTrainingArgumentsSeq2SeqTrainer 类。

    
    
    xxxxxxxxxx
    from transformers import Seq2SeqTrainingArguments batch_size = 8 num_train_epochs = 2 # 为演示方便,暂时仅用 2 个 epoch logging_steps = len(tokenized_datasets["train"]) // batch_size # 每个 epoch 记录一次日志 model_name = model_checkpoint.split("/")[-1] args = Seq2SeqTrainingArguments( output_dir=f"{model_name}-finetuned-amazon-en-es", evaluation_strategy="epoch", learning_rate=5.6e-5, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, weight_decay=0.01, save_total_limit=1, # 最多保存 1 个 checkpoint,因为每个 checkpoint 太大了 (2.5GB) num_train_epochs=num_train_epochs, predict_with_generate=True, # 在评估期间生成摘要从而计算 ROUGE 得分 logging_steps=logging_steps, push_to_hub=False, # 是否允许我们将训练好的模型推送到 Hub )

    predict_with_generate=True 会告诉 Seq2SeqTrainer 在评估时调用模型的 generate() 方法来生成摘要。

  3. 然后我们为 Trainer 提供一个 compute_metrics() 函数,以便在训练期间评估模型。这里稍微有点复杂,因为我们需要在计算 ROUGE 分数之前将 outputlabel 解码为文本,从而提供给 rouge_score.compute() 来使用。此外,还需要利用 nltk 中的 sent_tokenize() 函数来用换行符分隔摘要的句子:

    
    
    xxxxxxxxxx
    import numpy as np def compute_metrics(eval_pred): predictions, labels = eval_pred decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) # 对 prediction 解码 labels = np.where(labels != -100, labels, tokenizer.pad_token_id) # 用 pad_token_id 替换 label = -100 decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) # 对 label 进行解码 decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds] # 对每个样本进行断句 decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels] # 对每个样本进行断句 result = rouge_score.compute( # 计算 ROUGE 分 predictions=decoded_preds, references=decoded_labels, use_stemmer=True ) result = {key: value.mid.fmeasure * 100 for key, value in result.items()} # 仅获取 mid score return {k: round(v, 4) for k, v in result.items()}
  4. 接下来我们需要为 seq-to-seq 任务定义一个 data collator 。在解码过程中,对于 mT5,我们需要将 label 右移一位从而作为 decoder 的输入。Transformers 提供了一个 DataCollatorForSeq2Seq ,它为我们动态地填充 inputlabel 。要实例化这个collator ,我们只需要提供 tokenizermodel

    
    
    xxxxxxxxxx
    from transformers import DataCollatorForSeq2Seq data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

    我们看看这个 collator 在输入一个 mini batch 样本时会产生什么。

    首先移除所有的类型为字符串的列,因为 collator 不知道如何处理这些列:

    
    
    xxxxxxxxxx
    tokenized_datasets = tokenized_datasets.remove_columns( books_dataset["train"].column_names ) # tokenized_datasets 包含的列: review_id, product_id, reviewer_id, stars, review_body, review_title, language, product_category, input_ids, attention_mask, labels # books_dataset["train"] 包含的列: review_id, product_id, reviewer_id, stars, review_body, review_title, language, product_category

    由于 collator 需要一个 dict 的列表,其中每个 dict 代表数据集中的一个样本,我们还需要在将数据传递给 data collator 之前将数据整理成预期的格式:

    
    
    xxxxxxxxxx
    features = [tokenized_datasets["train"][i] for i in range(2)] print(data_collator(features)) # {'input_ids': tensor([[...], # [...]]), # 'attention_mask': tensor([[1...], # [...]]), # 'labels': tensor([[ 298, 259, 5994, 269, 774, 5547, 1], # [ 298, 10380, 304, 13992, 291, 1, -100]]), # 'decoder_input_ids': tensor([[ 0, 298, 259, 5994, 269, 774, 5547], # [ 0, 298, 10380, 304, 13992, 291, 1]])}

    如果某一个样本比另一个样本要短,那么它的 input_ids/attention_mask 右侧将被填充 [PAD] tokentoken ID0 )。 类似地,我们可以看到 labels 已用 -100 填充,以确保 pad token 被损失函数忽略。最后,我们可以看到一个新的 decoder_input_ids,它通过在开头插入 [PAD] token 将标签向右移动来形成。

  5. 现在开始实例化 Trainer 并进行训练了:

    
    
    xxxxxxxxxx
    from transformers import Seq2SeqTrainer trainer = Seq2SeqTrainer( model, args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["validation"], data_collator=data_collator, tokenizer=tokenizer, compute_metrics=compute_metrics, ) trainer.train() # 训练 trainer.evaluate() # 评估 # {'eval_loss': nan, # 'eval_rouge1': 4.0461, # 'eval_rouge2': 0.7318, # 'eval_rougeL': 3.9266, # 'eval_rougeLsum': 3.9468, # 'eval_runtime': 6.6003, # 'eval_samples_per_second': 36.059, # 'eval_steps_per_second': 4.545, # 'epoch': 2.0} # trainer.push_to_hub(commit_message="Training complete", tags="summarization") # 推送到 huggingface
  6. 使用微调的模型:

    
    
    xxxxxxxxxx
    from transformers import pipeline summarizer = pipeline("summarization", model = "./mt5-base-finetuned-amazon-en-es/checkpoint-2000")

    我们可以将测试集中的一些样本(模型还没有看到)馈入 pipeline ,从而了解生成的摘要的质量。

    我们实现一个简单的函数来一起显示评论、标题、以及生成的摘要:

    
    
    xxxxxxxxxx
    def print_summary(idx): review = books_dataset["test"][idx]["review_body"] title = books_dataset["test"][idx]["review_title"] summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"] print(f"'>>> Review: {review}'") print(f"\n'>>> Title: {title}'") print(f"\n'>>> Summary: {summary}'") print_summary(100)

4.4 自定义训练过程

  1. 使用 Accelerate 来微调 mT5 的过程,与微调文本分类模型非常相似。区别在于这里需要在训练期间显式生成摘要,并定义如何计算ROUGE 分数。

  2. 创建 dataloader:我们需要做的第一件事是为每个数据集的每个 split 创建一个DataLoader。 由于 PyTorch dataloader 需要batch 的张量,我们需要在数据集中将格式设置为torch

    
    
    xxxxxxxxxx
    from datasets import load_dataset from datasets import concatenate_datasets, DatasetDict from transformers import AutoTokenizer from torch.utils.data import DataLoader from transformers import DataCollatorForSeq2Seq from transformers import AutoModelForSeq2SeqLM ##************* 创建 dataloader ***************** ##***** 加载数据 spanish_dataset = load_dataset("amazon_reviews_multi", "es") english_dataset = load_dataset("amazon_reviews_multi", "en") def filter_books(example): return ( example["product_category"] == "book" or example["product_category"] == "digital_ebook_purchase" ) spanish_books = spanish_dataset.filter(filter_books) english_books = english_dataset.filter(filter_books) ##****** 合并这两种语言的数据集 books_dataset = DatasetDict() for split in english_books.keys(): books_dataset[split] = concatenate_datasets( [english_books[split], spanish_books[split]]) books_dataset[split] = books_dataset[split].shuffle(seed=42) ##****** 短标题过滤 books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2) ##****** tokenization model_checkpoint = "/mnt/disk_b/ModelZoo/mt5-base" # 提前下载到本地 tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) max_input_length = 512 # 评论的长度的上限 max_target_length = 30 # 标题的长度的上限 def preprocess_function(examples): model_inputs = tokenizer( examples["review_body"], max_length=max_input_length, truncation=True ) # Set up the tokenizer for targets with tokenizer.as_target_tokenizer(): labels = tokenizer( examples["review_title"], max_length=max_target_length, truncation=True ) model_inputs["labels"] = labels["input_ids"] return model_inputs tokenized_datasets = books_dataset.map(preprocess_function, batched=True) tokenized_datasets = tokenized_datasets.remove_columns( books_dataset["train"].column_names ) ##****** 创建 model model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint) ##****** data_collator data_collator = DataCollatorForSeq2Seq(tokenizer, model=model) ##********* 创建 dataloader tokenized_datasets.set_format("torch") batch_size = 8 train_dataloader = DataLoader( tokenized_datasets["train"], shuffle=True, collate_fn=data_collator, batch_size=batch_size, ) eval_dataloader = DataLoader( tokenized_datasets["validation"], collate_fn=data_collator, batch_size=batch_size )
  3. 创建训练组件:

    
    
    xxxxxxxxxx
    ##******************** 创建训练组件 ************* ##****** 优化器 from torch.optim import AdamW optimizer = AdamW(model.parameters(), lr=2e-5) ##***** 创建 accelerator from accelerate import Accelerator accelerator = Accelerator() model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader ) ##***** 创建学习率调度器 from transformers import get_scheduler num_train_epochs = 10 num_update_steps_per_epoch = len(train_dataloader) # 必须在 accelerator.prepare() 之后调用 num_training_steps = num_train_epochs * num_update_steps_per_epoch lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps, ) ##***** 创建 rouge_score from datasets import load_metric rouge_score = load_metric("rouge")
  4. 后处理:

    • 将生成的摘要进行断句(拆分为 "\n" 换行的句子),这是 ROUGE 需要的格式。

      
      
      xxxxxxxxxx
      import nltk def postprocess_text(preds, labels): preds = [pred.strip() for pred in preds] labels = [label.strip() for label in labels] # ROUGE expects a newline after each sentence preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds] labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels] return preds, labels
    • Hugging Face Hub 创建一个 repository 来存储模型。如果不需要上传,那么这一步不需要。

      
      
      xxxxxxxxxx
      # from huggingface_hub import get_full_repo_name # model_name = "test-bert-finetuned-squad-accelerate" # repo_name = get_full_repo_name(model_name) # from huggingface_hub import Repository output_dir = "results-mt5-finetuned-squad-accelerate" # repo = Repository(output_dir, clone_from=repo_name)
  5. 开始训练:(4090 显卡,模型大小 2.5G,训练期间占用内存 22.7G )

    
    
    xxxxxxxxxx
    from tqdm.auto import tqdm import torch import numpy as np progress_bar = tqdm(range(num_training_steps)) for epoch in range(num_train_epochs): ## Training 阶段 model.train() for step, batch in enumerate(train_dataloader): outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) # Evaluation 阶段 model.eval() for step, batch in enumerate(eval_dataloader): with torch.no_grad(): generated_tokens = accelerator.unwrap_model(model).generate( batch["input_ids"], attention_mask=batch["attention_mask"], ) ## 填充所生成的文本 ## 在多进程场景下,可能两个进程将 predictions/labels 在进程内部对齐,但是在进程间不一致,这里需要跨进程对齐 generated_tokens = accelerator.pad_across_processes( generated_tokens, dim=1, pad_index=tokenizer.pad_token_id ) labels = batch["labels"] # 如果预处理阶段没有填充那到最大长度,那么这里需要对 label 也进行填充 labels = accelerator.pad_across_processes( batch["labels"], dim=1, pad_index=tokenizer.pad_token_id ) generated_tokens = accelerator.gather(generated_tokens).cpu().numpy() labels = accelerator.gather(labels).cpu().numpy() # Replace -100 in the labels as we can't decode them labels = np.where(labels != -100, labels, tokenizer.pad_token_id) if isinstance(generated_tokens, tuple): generated_tokens = generated_tokens[0] decoded_preds = tokenizer.batch_decode( generated_tokens, skip_special_tokens=True ) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) decoded_preds, decoded_labels = postprocess_text( decoded_preds, decoded_labels ) rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels) # 计算指标 result = rouge_score.compute() # 计算 median ROUGE score result = {key: value.mid.fmeasure * 100 for key, value in result.items()} result = {k: round(v, 4) for k, v in result.items()} print(f"Epoch {epoch}:", result) # 在每个 epoch 结束时保存模型 accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save) if accelerator.is_main_process: tokenizer.save_pretrained(output_dir) # repo.push_to_hub( # commit_message=f"Training in progress epoch {epoch}", blocking=False # ) # Epoch 0: {'rouge1': 5.5492, 'rouge2': 0.6587, 'rougeL': 5.5844, 'rougeLsum': 5.5422} # Epoch 1: {'rouge1': 8.154, 'rouge2': 2.5786, 'rougeL': 8.0205, 'rougeLsum': 7.9891} # Epoch 2: {'rouge1': 13.8772, 'rouge2': 5.9258, 'rougeL': 13.86, 'rougeLsum': 13.858} # Epoch 3: {'rouge1': 14.3815, 'rouge2': 6.0753, 'rougeL': 14.1405, 'rougeLsum': 14.2002} # Epoch 4: {'rouge1': 12.9502, 'rouge2': 5.3787, 'rougeL': 12.8429, 'rougeLsum': 12.8553} # Epoch 5: {'rouge1': 13.613, 'rouge2': 6.2498, 'rougeL': 13.3715, 'rougeLsum': 13.3895} # Epoch 6: {'rouge1': 13.3266, 'rouge2': 6.0245, 'rougeL': 13.0357, 'rougeLsum': 13.0793} # Epoch 7: {'rouge1': 13.8225, 'rouge2': 6.4, 'rougeL': 13.5457, 'rougeLsum': 13.6644} # Epoch 8: {'rouge1': 13.9203, 'rouge2': 6.5123, 'rougeL': 13.6504, 'rougeLsum': 13.6976} # Epoch 9: {'rouge1': 14.374, 'rouge2': 6.9012, 'rougeL': 14.1307, 'rougeLsum': 14.2309}
  6. 使用:

    
    
    xxxxxxxxxx
    from transformers import pipeline summarizer = pipeline("summarization", model = "./mt5-base-finetuned-amazon-en-es/checkpoint-2000") def print_summary(idx): review = books_dataset["test"][idx]["review_body"] title = books_dataset["test"][idx]["review_title"] summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"] print(f"'>>> Review: {review}'") print(f"\n'>>> Title: {title}'") print(f"\n'>>> Summary: {summary}'") print_summary(-100) # '>>> Review: The story was all over the place. I felt no connection to Evelyn/Eva/Evie, and the ending was anticlimactic. Thank goodness it was a free book.' # '>>> Title: Neither gripping or emotional.' # '>>> Summary: Good book.'

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文