数学基础
- 线性代数
- 概率论与随机过程
- 数值计算
- 蒙特卡洛方法与 MCMC 采样
- 机器学习方法概论
统计学习
深度学习
- 深度学习简介
- 深度前馈网络
- 反向传播算法
- 正则化
- 深度学习中的最优化问题
- 卷积神经网络
- CNN:图像分类
- 循环神经网络 RNN
- Transformer
- 一、Transformer [2017]
- 二、Universal Transformer [2018]
- 三、Transformer-XL [2019]
- 四、GPT1 [2018]
- 五、GPT2 [2019]
- 六、GPT3 [2020]
- 七、OPT [2022]
- 八、BERT [2018]
- 九、XLNet [2019]
- 十、RoBERTa [2019]
- 十一、ERNIE 1.0 [2019]
- 十二、ERNIE 2.0 [2019]
- 十三、ERNIE 3.0 [2021]
- 十四、ERNIE-Huawei [2019]
- 十五、MT-DNN [2019]
- 十六、BART [2019]
- 十七、mBART [2020]
- 十八、SpanBERT [2019]
- 十九、ALBERT [2019]
- 二十、UniLM [2019]
- 二十一、MASS [2019]
- 二十二、MacBERT [2019]
- 二十三、Fine-Tuning Language Models from Human Preferences [2019]
- 二十四 Learning to summarize from human feedback [2020]
- 二十五、InstructGPT [2022]
- 二十六、T5 [2020]
- 二十七、mT5 [2020]
- 二十八、ExT5 [2021]
- 二十九、Muppet [2021]
- 三十、Self-Attention with Relative Position Representations [2018]
- 三十一、USE [2018]
- 三十二、Sentence-BERT [2019]
- 三十三、SimCSE [2021]
- 三十四、BERT-Flow [2020]
- 三十五、BERT-Whitening [2021]
- 三十六、Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings [2019]
- 三十七、CERT [2020]
- 三十八、DeCLUTR [2020]
- 三十九、CLEAR [2020]
- 四十、ConSERT [2021]
- 四十一、Sentence-T5 [2021]
- 四十二、ULMFiT [2018]
- 四十三、Scaling Laws for Neural Language Models [2020]
- 四十四、Chinchilla [2022]
- 四十七、GLM-130B [2022]
- 四十八、GPT-NeoX-20B [2022]
- 四十九、Bloom [2022]
- 五十、PaLM [2022] (粗读)
- 五十一、PaLM2 [2023](粗读)
- 五十二、Self-Instruct [2022]
- 句子向量
- 词向量
- 传统CTR 预估模型
- CTR 预估模型
- 一、DSSM [2013]
- 二、FNN [2016]
- 三、PNN [2016]
- 四、DeepCrossing [2016]
- 五、Wide 和 Deep [2016]
- 六、DCN [2017]
- 七、DeepFM [2017]
- 八、NFM [2017]
- 九、AFM [2017]
- 十、xDeepFM [2018]
- 十一、ESMM [2018]
- 十二、DIN [2017]
- 十三、DIEN [2019]
- 十四、DSIN [2019]
- 十五、DICM [2017]
- 十六、DeepMCP [2019]
- 十七、MIMN [2019]
- 十八、DMR [2020]
- 十九、MiNet [2020]
- 二十、DSTN [2019]
- 二十一、BST [2019]
- 二十二、SIM [2020]
- 二十三、ESM2 [2019]
- 二十四、MV-DNN [2015]
- 二十五、CAN [2020]
- 二十六、AutoInt [2018]
- 二十七、Fi-GNN [2019]
- 二十八、FwFM [2018]
- 二十九、FM2 [2021]
- 三十、FiBiNET [2019]
- 三十一、AutoFIS [2020]
- 三十三、AFN [2020]
- 三十四、FGCNN [2019]
- 三十五、AutoCross [2019]
- 三十六、InterHAt [2020]
- 三十七、xDeepInt [2023]
- 三十九、AutoDis [2021]
- 四十、MDE [2020]
- 四十一、NIS [2020]
- 四十二、AutoEmb [2020]
- 四十三、AutoDim [2021]
- 四十四、PEP [2021]
- 四十五、DeepLight [2021]
- 图的表达
- 一、DeepWalk [2014]
- 二、LINE [2015]
- 三、GraRep [2015]
- 四、TADW [2015]
- 五、DNGR [2016]
- 六、Node2Vec [2016]
- 七、WALKLETS [2016]
- 八、SDNE [2016]
- 九、CANE [2017]
- 十、EOE [2017]
- 十一、metapath2vec [2017]
- 十二、GraphGAN [2018]
- 十三、struc2vec [2017]
- 十四、GraphWave [2018]
- 十五、NetMF [2017]
- 十六、NetSMF [2019]
- 十七、PTE [2015]
- 十八、HNE [2015]
- 十九、AANE [2017]
- 二十、LANE [2017]
- 二十一、MVE [2017]
- 二十二、PMNE [2017]
- 二十三、ANRL [2018]
- 二十四、DANE [2018]
- 二十五、HERec [2018]
- 二十六、GATNE [2019]
- 二十七、MNE [2018]
- 二十八、MVN2VEC [2018]
- 二十九、SNE [2018]
- 三十、ProNE [2019]
- Graph Embedding 综述
- 图神经网络
- 一、GNN [2009]
- 二、Spectral Networks 和 Deep Locally Connected Networks [2013]
- 三、Fast Localized Spectral Filtering On Graph [2016]
- 四、GCN [2016]
- 五、神经图指纹 [2015]
- 六、GGS-NN [2016]
- 七、PATCHY-SAN [2016]
- 八、GraphSAGE [2017]
- 九、GAT [2017]
- 十、R-GCN [2017]
- 十一、 AGCN [2018]
- 十二、FastGCN [2018]
- 十三、PinSage [2018]
- 十四、GCMC [2017]
- 十五、JK-Net [2018]
- 十六、PPNP [2018]
- 十七、VRGCN [2017]
- 十八、ClusterGCN [2019]
- 十九、LDS-GNN [2019]
- 二十、DIAL-GNN [2019]
- 二十一、HAN [2019]
- 二十二、HetGNN [2019]
- 二十三、HGT [2020]
- 二十四、GPT-GNN [2020]
- 二十五、Geom-GCN [2020]
- 二十六、Graph Network [2018]
- 二十七、GIN [2019]
- 二十八、MPNN [2017]
- 二十九、UniMP [2020]
- 三十、Correct and Smooth [2020]
- 三十一、LGCN [2018]
- 三十二、DGCNN [2018]
- 三十三、AS-GCN
- 三十四、DGI [2018]
- 三十五、DIFFPOLL [2018]
- 三十六、DCNN [2016]
- 三十七、IN [2016]
- 图神经网络 2
- 图神经网络 3
- 推荐算法(传统方法)
- 一、Tapestry [1992]
- 二、GroupLens [1994]
- 三、ItemBased CF [2001]
- 四、Amazon I-2-I CF [2003]
- 五、Slope One Rating-Based CF [2005]
- 六、Bipartite Network Projection [2007]
- 七、Implicit Feedback CF [2008]
- 八、PMF [2008]
- 九、SVD++ [2008]
- 十、MMMF 扩展 [2008]
- 十一、OCCF [2008]
- 十二、BPR [2009]
- 十三、MF for RS [2009]
- 十四、 Netflix BellKor Solution [2009]
- 推荐算法(神经网络方法 1)
- 一、MIND [2019](用于召回)
- 二、DNN For YouTube [2016]
- 三、Recommending What Video to Watch Next [2019]
- 四、ESAM [2020]
- 五、Facebook Embedding Based Retrieval [2020](用于检索)
- 六、Airbnb Search Ranking [2018]
- 七、MOBIUS [2019](用于召回)
- 八、TDM [2018](用于检索)
- 九、DR [2020](用于检索)
- 十、JTM [2019](用于检索)
- 十一、Pinterest Recommender System [2017]
- 十二、DLRM [2019]
- 十三、Applying Deep Learning To Airbnb Search [2018]
- 十四、Improving Deep Learning For Airbnb Search [2020]
- 十五、HOP-Rec [2018]
- 十六、NCF [2017]
- 十七、NGCF [2019]
- 十八、LightGCN [2020]
- 十九、Sampling-Bias-Corrected Neural Modeling [2019](检索)
- 二十、EGES [2018](Matching 阶段)
- 二十一、SDM [2019](Matching 阶段)
- 二十二、COLD [2020 ] (Pre-Ranking 模型)
- 二十三、ComiRec [2020](https://www.wenjiangs.com/doc/0b4e1736-ac78)
- 二十四、EdgeRec [2020]
- 二十五、DPSR [2020](检索)
- 二十六、PDN [2021](mathcing)
- 二十七、时空周期兴趣学习网络ST-PIL [2021]
- 推荐算法之序列推荐
- 一、FPMC [2010]
- 二、GRU4Rec [2015]
- 三、HRM [2015]
- 四、DREAM [2016]
- 五、Improved GRU4Rec [2016]
- 六、NARM [2017]
- 七、HRNN [2017]
- 八、RRN [2017]
- 九、Caser [2018]
- 十、p-RNN [2016]
- 十一、GRU4Rec Top-k Gains [2018]
- 十二、SASRec [2018]
- 十三、RUM [2018]
- 十四、SHAN [2018]
- 十五、Phased LSTM [2016]
- 十六、Time-LSTM [2017]
- 十七、STAMP [2018]
- 十八、Latent Cross [2018]
- 十九、CSRM [2019]
- 二十、SR-GNN [2019]
- 二十一、GC-SAN [2019]
- 二十二、BERT4Rec [2019]
- 二十三、MCPRN [2019]
- 二十四、RepeatNet [2019]
- 二十五、LINet(2019)
- 二十六、NextItNet [2019]
- 二十七、GCE-GNN [2020]
- 二十八、LESSR [2020]
- 二十九、HyperRec [2020]
- 三十、DHCN [2021]
- 三十一、TiSASRec [2020]
- 推荐算法(综述)
- 多任务学习
- 系统架构
- 实践方法论
- 深度强化学习 1
- 自动代码生成
工具
- CRF
- lightgbm
- xgboost
- scikit-learn
- spark
- numpy
- matplotlib
- pandas
- huggingface_transformer
- 一、Tokenizer
- 二、Datasets
- 三、Model
- 四、Trainer
- 五、Evaluator
- 六、Pipeline
- 七、Accelerate
- 八、Autoclass
- 九、应用
- 十、Gradio
Scala
- 环境搭建
- 基础知识
- 函数
- 类
- 样例类和模式匹配
- 测试和注解
- 集合 collection(一)
- 集合collection(二)
- 集成 Java
- 并发
三、从头开始训练因果语言模型
首先为
masked language modeling
选择一个合适的预训练模型,如前面用到的"bert-base-cased"
。xxxxxxxxxx
from transformers import AutoModelForMaskedLM model_checkpoint = "bert-base-cased" model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) num_parameters = model.num_parameters() / 1_000_000 print(f"BERT_Base number of parameters: {round(num_parameters)}M'") # BERT_Base number of parameters: 108M'现在我们来看看
BERT_Base
如何补全一个被掩码的单词:xxxxxxxxxx
from transformers import AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) text = "WuHan City a great [MASK]." inputs = tokenizer(text, return_tensors="pt") print(inputs) # { # 'input_ids': tensor([[ 101, 8769, 3048, 1389, 1392, 170, 1632, 103, 119, 102]]), # 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), # 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]) # } token_logits = model(**inputs).logits # shape: [1, 10, 28996] ##************ 找到 [MASK] 的位置并抽取它的 logits *********** mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] print(mask_token_index) # tokenizer.mask_token_id = 103 # tensor([7]) mask_token_logits = token_logits[0, mask_token_index, :] # shape: [1, 28996] ##************ 返回 [MASK] 的 top-k 候选 *********** top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() for token in top_5_tokens: print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'") # '>>> WuHan City a great city.' # '>>> WuHan City a great place.' # '>>> WuHan City a great town.' # '>>> WuHan City a great village.' # '>>> WuHan City a great name.'
2.1 数据集和数据处理
加载数据集:我们在
Large Movie Review Dataset: IMDb
上微调BERT_Base
。该数据集有训练集、测试集、还有unsupervised
等三个split
。xxxxxxxxxx
from datasets import load_dataset imdb_dataset = load_dataset("imdb") print(imdb_dataset) # DatasetDict({ # train: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # test: Dataset({ # features: ['text', 'label'], # num_rows: 25000 # }) # unsupervised: Dataset({ # features: ['text', 'label'], # num_rows: 50000 # }) # })数据处理:对于自回归语言建模、以及掩码语言建模,一个常见的预处理步骤是拼接所有样本,然后将整个语料库拆分为相同大小的
block
。我们还需要保留word id
序列,以便后续用于全词掩码(whole word masking
)。xxxxxxxxxx
result = tokenizer("Welcome to WuHan City", is_split_into_words=False) # 执行 tokenization print(result) # { # 'input_ids': [101, 12050, 1106, 8769, 3048, 1389, 1392, 102], # 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], # 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1] # } print(result.word_ids()) # 每个 token 对应的 word id # [None, 0, 1, 2, 2, 2, 3, None]此外,我们删除
text
字段和label
字段,因为不再需要。我们构建一个函数来执行这些:xxxxxxxxxx
def tokenize_function(examples): # examples 是一个 batch 的样本 result = tokenizer(examples["text"]) # result 包含 batch 结果 if tokenizer.is_fast: # result.word_ids(i) 返回第 i 个样本的 word id 序列 result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))] return result # 使用 batched=True tokenized_datasets = imdb_dataset.map( tokenize_function, batched=True, remove_columns=["text", "label"] ) print(tokenized_datasets) # word_ids 列是我们人工添加的 # DatasetDict({ # train: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'], # num_rows: 25000 # }) # test: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'], # num_rows: 25000 # }) # unsupervised: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids'], # num_rows: 50000 # }) # })现在我们已经完成了
tokenization
。下一步是将它们拼接在一起然后分块。块的大小怎么选择?这取决于GPU
的显存大小。此外,还可以参考模型的最大上下文的长度,这可以通过tokenizer.model_max_length
属性来判断:xxxxxxxxxx
print(tokenizer.model_max_length) # 512然后我们拼接文本并拆分为大小为
block_size
的块。xxxxxxxxxx
def group_texts(examples, chunk_size = 128): # keys() 为 ('input_ids', 'token_type_ids', 'attention_mask', 'word_ids') concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} # 拼接样本 total_length = len(concatenated_examples[list(examples.keys())[0]]) # 计算总的 token 长度 total_length = (total_length // chunk_size) * chunk_size # 移除最后一个小于 chunk_size 的块(也可以填充最后一个块到 chunk_size 长度) result = { k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)] # 执行分块 for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() # label 就是 input token 序列,因为 MLM 的 label 就是被掩码的 token return result lm_datasets = tokenized_datasets.map(group_texts, batched=True) print(lm_datasets) # DatasetDict({ # train: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 63037 # }) # test: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 61623 # }) # unsupervised: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 126497 # }) # })以训练集为例,可以看到,样本数量要比原始的
25k
个样本要多。因为现在的样本是contiguous token
,而不是原始的情感分类样本。现在缺少关键的步骤:在输入的随机位置插入
[MASK] token
。这需要在训练过程中动态地插入,而不是静态地提前准备好。
2.2 使用 Trainer API 微调模型
如前所述,我们需要再训练过程中动态地在输入的随机位置插入
[MASK] token
。这需要一个特殊的data collator
从而可以在训练过程中动态地随机掩码输入文本中的一些token
,即DataCollatorForLanguageModeling
。我们需要向它传入mlm_probability
参数从而指定masked token
的占比。我们选择15%
,因为这是论文中常用的配置:xxxxxxxxxx
from transformers import DataCollatorForLanguageModeling data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) samples = [lm_datasets["train"][i] for i in range(2)] for sample in samples: _ = sample.pop("word_ids") # 移除 word_ids ,否则 data_collator(samples) 抛出异常 for chunk in data_collator(samples)["input_ids"]: print(f"\n'>>> {tokenizer.decode(chunk)}'") # >>> [CLS] I rented I AM [MASK] deliberate [MASK]US - YEL [MASK]OW from my video store because of all thedating that surrounded it when it [MASK] first released in 1967. I also heard [MASK] [MASK] first it was seized by U. S. [MASK] if it ever tried to enter this [MASK], [MASK] being a fan of films [MASK] " controversial " I really had to see this for myself. < br / [MASK] < br / > The plot [MASK] centered around a young Swedish drama student named Lena who neighbouring to learn everything she can about [MASK]. In particular [MASK] wants [MASK] focus her attention [MASK] to making some sort of documentary on [MASK] the average Swed # >>> ##e thought about [MASK] political [MASK] such as [MASK] Vietnam War and [MASK] issues in [MASK] [MASK] States. In between asking politicians [MASK] ordinary [MASK] [MASK]mony of Stockholm about their opinions on politics, she [MASK] [MASK] with [MASK] drama [MASK] [MASK] classmates, [MASK] married men. [MASK] br / Quaker < br / > What kills me about I AM CURIOUS - YELLOW is that 40 years ago, this was considered pornographic. Really [MASK] [MASK] sex and nudi [MASK] scenes are few and far between [MASK] even then it's not shot like some cheaply made [MASK]orno [MASK] While my countrymen mind find it shocking, in [MASK]随机掩码的一个副作用是,当使用
Trainer
时,我们的评估指标可能是随机的(每一次评估的结果可能各不相同),因为测试集使用的也是相同的DataCollatorForLanguageModeling
。然而,我们可以利用Accelerate
来自定义训练过程(而不是Trainer
封装好的训练过程),从而在训练过程中冻结随机性。全词掩码
whole word masking: WWM
:全词掩码是掩码整个单词,而不仅是是掩码单词内的单个token
。如果我们想使用全词掩码,我们需要自己构建一个data collator
。此时,我们需要用到之前计算的word_ids
,它给出了每个token
对应的word id
。注意,除了与[MASK]
对应的label
以外,所有的其他label
都是-100
。xxxxxxxxxx
import collections import numpy as np from transformers import default_data_collator wwm_probability = 0.2 def whole_word_masking_data_collator(features): for feature in features: word_ids = feature.pop("word_ids") mapping = collections.defaultdict(list) # 存放 word_id 到它包含的 token id list 的映射 current_word_index = -1 current_word = None for idx, word_id in enumerate(word_ids): if word_id is not None: if word_id != current_word: current_word = word_id current_word_index += 1 mapping[current_word_index].append(idx) # 随机掩码 word mask = np.random.binomial(1, wwm_probability, (len(mapping),)) # 注意,单个元素的元组 (xxx, ) input_ids = feature["input_ids"] labels = feature["labels"] new_labels = [-100] * len(labels) # 默认全为 -100 for word_id in np.where(mask)[0]: # np.where(mask) 返回一个元组, word_id = word_id.item() for idx in mapping[word_id]: # 被掩码的单词所对应的 token_id new_labels[idx] = labels[idx] input_ids[idx] = tokenizer.mask_token_id feature["labels"] = new_labels return default_data_collator(features) samples = [lm_datasets["train"][i] for i in range(2)] batch = whole_word_masking_data_collator(samples) for chunk in batch["input_ids"]: print(f"\n'>>> {tokenizer.decode(chunk)}'") # >>> [CLS] I rented I AM [MASK] [MASK] [MASK] [MASK] - YELLOW from my [MASK] store because of all the controversy that [MASK] it when it was first released [MASK] [MASK]. [MASK] also heard that at first it was [MASK] [MASK] U [MASK] S [MASK] [MASK] if it [MASK] tried to enter this country, therefore [MASK] a fan [MASK] [MASK] [MASK] " [MASK] " I really had [MASK] see this for myself [MASK] [MASK] br / [MASK] < br / > The plot [MASK] [MASK] around a young [MASK] drama student [MASK] Lena who wants to learn everything she [MASK] [MASK] life. In particular she wants to [MASK] her attentions to making some sort of documentary [MASK] what [MASK] average Swed # >>> ##e thought about certain political issues such as the Vietnam War and race issues in [MASK] United States [MASK] In between asking politicians and ordinary denizens of [MASK] about their opinions on politics, [MASK] has [MASK] with [MASK] [MASK] teacher [MASK] [MASK], and married men. [MASK] [MASK] / [MASK] < br [MASK] > What kills me about I [MASK] CURIOUS - [MASK] [MASK] [MASK] [MASK] is that 40 years ago, this was considered pornographic. Really, the sex [MASK] nudity scenes are few and [MASK] [MASK] [MASK] even then it's [MASK] [MASK] like some cheaply made [MASK] [MASK] [MASK]. [MASK] my countrymen mind find [MASK] [MASK], in [MASK]数据集缩小:为了演示的方便,我们将训练集缩小为数千个样本。
xxxxxxxxxx
train_size = 10_000 test_size = int(0.1 * train_size) downsampled_dataset = lm_datasets["train"].train_test_split( train_size=train_size, test_size=test_size, seed=42 ) print(downsampled_dataset) # DatasetDict({ # train: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 10000 # }) # test: Dataset({ # features: ['input_ids', 'token_type_ids', 'attention_mask', 'word_ids', 'labels'], # num_rows: 1000 # }) # })配置
Trainer
:接下来我们可以登录Hugging Face Hub
(可选的,方式为在命令行中执行命令huggingface-cli login
)。xxxxxxxxxx
from transformers import TrainingArguments batch_size = 64 logging_steps = len(downsampled_dataset["train"]) // batch_size # 每个 epoch 打印 training loss model_name = model_checkpoint.split("/")[-1] training_args = TrainingArguments( output_dir=f"{model_name}-finetuned-imdb", overwrite_output_dir=True, evaluation_strategy="epoch", save_strategy="epoch", learning_rate=2e-5, weight_decay=0.01, num_train_epochs=3, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, push_to_hub=False, # 这里暂时先不 push 到 hub fp16=True, # 混合精度训练从而加速训练过程 logging_steps=logging_steps, # 设置 logging_steps remove_unused_columns = False, # 用于 WWM data_collator ) from transformers import Trainer trainer = Trainer( model=model, args=training_args, train_dataset=downsampled_dataset["train"], eval_dataset=downsampled_dataset["test"], # data_collator=data_collator, data_collator=whole_word_masking_data_collator # WWM data_collator )默认情况下,
Trainer
将删除不属于模型的forward()
方法的列。这意味着, 如果你使用WWM data_collator
,你还需要设置remove_unused_columns = False
,以确保我们不会在训练期间丢失word_ids
列。语言模型的困惑度(
perplexity
):一个好的语言模型是为语法正确的句子分配高概率,为无意义的句子分配低概率。我们通过困惑度来衡量这种概率。困惑度有多种数学定义,这里我们采用交叉熵损失的指数。因此,我们可以通过Trainer.evaluate()
函数计算测试集上的交叉熵损失,然后取结果的指数来计算预训练模型的困惑度:xxxxxxxxxx
import math eval_results = trainer.evaluate() print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}") # >>> Perplexity: 39.75 trainer.train() eval_results = trainer.evaluate() print(f">>> Perplexity: {math.exp(eval_results['eval_loss']):.2f}") # >>> Perplexity: 22.11 trainer.push_to_hub() tokenizer.save_pretrained(output_dir) # 保存 tokenizer较低的困惑度分数意味着更好的语言模型。可以看到:模型困惑度降低了很多。这表明模型已经了解了一些关于电影评论领域的知识。
使用模型:现在可以通过
Transformers
的pipeline
来调用微调后的模型:xxxxxxxxxx
from transformers import pipeline mask_filler = pipeline( # 注意,tokenizer 也需要保存在这个目录下 "fill-mask", model="./bert-base-cased-finetuned-imdb/checkpoint-471" ) preds = mask_filler("WuHan City a great [MASK].") for pred in preds: print(f">>> {pred['sequence']}") # >>> WuHan City a great city. # >>> WuHan City a great place. # >>> WuHan City a great town. # >>> WuHan City a great one. # >>> WuHan City a great name.
2.3 自定义训练过程
DataCollatorForLanguageModeling
对每次评估过程采用随机掩码,因此每次训练运行时,我们都会看到困惑度分数的一些波动。消除这种随机性来源的一种方法是在整个测试集上应用一次掩码,然后使用Transformers
中的默认data collator
。注意,自定义训练过程人工创建了
dataloader
,而Trainer API
只需要传入dataset
而无需创建dataloader
。整体代码如下所示:
xxxxxxxxxx
from transformers import AutoModelForMaskedLM from transformers import AutoTokenizer from tqdm.auto import tqdm import torch import math from datasets import load_dataset from transformers import DataCollatorForLanguageModeling import collections import numpy as np from transformers import default_data_collator from torch.utils.data import DataLoader from torch.optim import AdamW from accelerate import Accelerator from transformers import get_scheduler ##********** 加载 pre-trained model, tokenizer 和数据集 ******** model_checkpoint = "bert-base-cased" model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) imdb_dataset = load_dataset("imdb") ##********* tokenization **************** def tokenize_function(examples): # examples 是一个 batch 的样本 result = tokenizer(examples["text"]) # result 包含 batch 结果 if tokenizer.is_fast: # result.word_ids(i) 返回第 i 个样本的 word id 序列 result["word_ids"] = [result.word_ids(i) for i in range(len(result["input_ids"]))] return result tokenized_datasets = imdb_dataset.map( tokenize_function, batched=True, remove_columns=["text", "label"] ) ##********* 分块 **************** def group_texts(examples, chunk_size = 128): # keys() 为 ('input_ids', 'token_type_ids', 'attention_mask', 'word_ids') concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()} # 拼接样本 total_length = len(concatenated_examples[list(examples.keys())[0]]) # 计算总的 token 长度 total_length = (total_length // chunk_size) * chunk_size # 移除最后一个小于 chunk_size 的块(也可以填充最后一个块到 chunk_size 长度) result = { k: [t[i : i + chunk_size] for i in range(0, total_length, chunk_size)] # 执行分块 for k, t in concatenated_examples.items() } result["labels"] = result["input_ids"].copy() # label 就是 input token 序列,因为 MLM 的 label 就是被掩码的 token return result lm_datasets = tokenized_datasets.map(group_texts, batched=True) train_size = 10_000 test_size = int(0.1 * train_size) downsampled_dataset = lm_datasets["train"].train_test_split( # demo: 减小数据规模 train_size=train_size, test_size=test_size, seed=42 ) ##********** 创建 WWM data collator *********** # data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15) wwm_probability = 0.2 def whole_word_masking_data_collator(features): for feature in features: word_ids = feature.pop("word_ids") mapping = collections.defaultdict(list) # 存放 word_id 到它包含的 token id list 的映射 current_word_index = -1 current_word = None for idx, word_id in enumerate(word_ids): if word_id is not None: if word_id != current_word: current_word = word_id current_word_index += 1 mapping[current_word_index].append(idx) # 随机掩码 word mask = np.random.binomial(1, wwm_probability, (len(mapping),)) # 注意,单个元素的元组 (xxx, ) input_ids = feature["input_ids"] labels = feature["labels"] new_labels = [-100] * len(labels) # 默认全为 -100 for word_id in np.where(mask)[0]: # np.where(mask) 返回一个元组, word_id = word_id.item() for idx in mapping[word_id]: # 被掩码的单词所对应的 token_id new_labels[idx] = labels[idx] input_ids[idx] = tokenizer.mask_token_id feature["labels"] = new_labels return default_data_collator(features) ##************ 对测试集进行静态掩码 *************** def insert_random_mask(batch): features = [dict(zip(batch, t)) for t in zip(*batch.values())] masked_inputs = whole_word_masking_data_collator(features) # 对于数据集中的每一列,创建一个对应的 masked 列 return {"masked_" + k: v.numpy() for k, v in masked_inputs.items()} # downsampled_dataset = downsampled_dataset.remove_columns(["word_ids"]) # 如果是 whole_word_masking_data_collator 则注释掉这一行 eval_dataset = downsampled_dataset["test"].map( insert_random_mask, batched=True, remove_columns=downsampled_dataset["test"].column_names, ) eval_dataset = eval_dataset.rename_columns( { "masked_input_ids": "input_ids", "masked_attention_mask": "attention_mask", "masked_labels": "labels", } ).remove_columns(["masked_token_type_ids"]) # 移除一些不需要的列 ##************ 创建 data loader *************** batch_size = 64 train_dataloader = DataLoader( downsampled_dataset["train"], shuffle=True, batch_size=batch_size, collate_fn=whole_word_masking_data_collator, ) eval_dataloader = DataLoader( eval_dataset, batch_size=batch_size, collate_fn=default_data_collator ) ##************ 创建训练组件*************** optimizer = AdamW(model.parameters(), lr=5e-5) accelerator = Accelerator() model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare( model, optimizer, train_dataloader, eval_dataloader ) num_train_epochs = 3 num_update_steps_per_epoch = len(train_dataloader) num_training_steps = num_train_epochs * num_update_steps_per_epoch lr_scheduler = get_scheduler( "linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps, ) ##********** 在 Hugging Face Hub 上创建一个模型库 (可以忽略) ********** # from huggingface_hub import get_full_repo_name model_name = "%s-finetuned-imdb-accelerate"%model_checkpoint # repo_name = get_full_repo_name(model_name) # repo_name # rom huggingface_hub import Repository output_dir = model_name # repo = Repository(output_dir, clone_from=repo_name) ##************ 训练和评估 **************** progress_bar = tqdm(range(num_training_steps)) for epoch in range(num_train_epochs): # Training model.train() for batch in train_dataloader: outputs = model(**batch) loss = outputs.loss accelerator.backward(loss) optimizer.step() lr_scheduler.step() optimizer.zero_grad() progress_bar.update(1) # Evaluation model.eval() losses = [] for step, batch in enumerate(eval_dataloader): with torch.no_grad(): outputs = model(**batch) loss = outputs.loss losses.append(accelerator.gather(loss)) # 跨进程收集每个样本的 loss losses = torch.cat(losses) losses = losses[: len(eval_dataset)] # 获取验证集每个样本的 loss try: perplexity = math.exp(torch.mean(losses)) except OverflowError: perplexity = float("inf") print(f">>> Epoch {epoch}: Perplexity: {perplexity}") # >>> Epoch 0: Perplexity: 22.54525292335159 # >>> Epoch 1: Perplexity: 21.186613045279536 # >>> Epoch 2: Perplexity: 20.757056615284373 ##*********** 保存、上传微调好的模型 **************** accelerator.wait_for_everyone() unwrapped_model = accelerator.unwrap_model(model) unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save) # 用 accelerator.save if accelerator.is_main_process: tokenizer.save_pretrained(output_dir) # repo.push_to_hub( # commit_message=f"Training in progress epoch {epoch}", blocking=False # )
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论