数学基础
- 线性代数
- 概率论与随机过程
- 数值计算
- 蒙特卡洛方法与 MCMC 采样
- 机器学习方法概论
统计学习
深度学习
- 深度学习简介
- 深度前馈网络
- 反向传播算法
- 正则化
- 深度学习中的最优化问题
- 卷积神经网络
- CNN:图像分类
- 循环神经网络 RNN
- Transformer
- 一、Transformer [2017]
- 二、Universal Transformer [2018]
- 三、Transformer-XL [2019]
- 四、GPT1 [2018]
- 五、GPT2 [2019]
- 六、GPT3 [2020]
- 七、OPT [2022]
- 八、BERT [2018]
- 九、XLNet [2019]
- 十、RoBERTa [2019]
- 十一、ERNIE 1.0 [2019]
- 十二、ERNIE 2.0 [2019]
- 十三、ERNIE 3.0 [2021]
- 十四、ERNIE-Huawei [2019]
- 十五、MT-DNN [2019]
- 十六、BART [2019]
- 十七、mBART [2020]
- 十八、SpanBERT [2019]
- 十九、ALBERT [2019]
- 二十、UniLM [2019]
- 二十一、MASS [2019]
- 二十二、MacBERT [2019]
- 二十三、Fine-Tuning Language Models from Human Preferences [2019]
- 二十四 Learning to summarize from human feedback [2020]
- 二十五、InstructGPT [2022]
- 二十六、T5 [2020]
- 二十七、mT5 [2020]
- 二十八、ExT5 [2021]
- 二十九、Muppet [2021]
- 三十、Self-Attention with Relative Position Representations [2018]
- 三十一、USE [2018]
- 三十二、Sentence-BERT [2019]
- 三十三、SimCSE [2021]
- 三十四、BERT-Flow [2020]
- 三十五、BERT-Whitening [2021]
- 三十六、Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings [2019]
- 三十七、CERT [2020]
- 三十八、DeCLUTR [2020]
- 三十九、CLEAR [2020]
- 四十、ConSERT [2021]
- 四十一、Sentence-T5 [2021]
- 四十二、ULMFiT [2018]
- 四十三、Scaling Laws for Neural Language Models [2020]
- 四十四、Chinchilla [2022]
- 四十七、GLM-130B [2022]
- 四十八、GPT-NeoX-20B [2022]
- 四十九、Bloom [2022]
- 五十、PaLM [2022] (粗读)
- 五十一、PaLM2 [2023](粗读)
- 五十二、Self-Instruct [2022]
- 句子向量
- 词向量
- 传统CTR 预估模型
- CTR 预估模型
- 一、DSSM [2013]
- 二、FNN [2016]
- 三、PNN [2016]
- 四、DeepCrossing [2016]
- 五、Wide 和 Deep [2016]
- 六、DCN [2017]
- 七、DeepFM [2017]
- 八、NFM [2017]
- 九、AFM [2017]
- 十、xDeepFM [2018]
- 十一、ESMM [2018]
- 十二、DIN [2017]
- 十三、DIEN [2019]
- 十四、DSIN [2019]
- 十五、DICM [2017]
- 十六、DeepMCP [2019]
- 十七、MIMN [2019]
- 十八、DMR [2020]
- 十九、MiNet [2020]
- 二十、DSTN [2019]
- 二十一、BST [2019]
- 二十二、SIM [2020]
- 二十三、ESM2 [2019]
- 二十四、MV-DNN [2015]
- 二十五、CAN [2020]
- 二十六、AutoInt [2018]
- 二十七、Fi-GNN [2019]
- 二十八、FwFM [2018]
- 二十九、FM2 [2021]
- 三十、FiBiNET [2019]
- 三十一、AutoFIS [2020]
- 三十三、AFN [2020]
- 三十四、FGCNN [2019]
- 三十五、AutoCross [2019]
- 三十六、InterHAt [2020]
- 三十七、xDeepInt [2023]
- 三十九、AutoDis [2021]
- 四十、MDE [2020]
- 四十一、NIS [2020]
- 四十二、AutoEmb [2020]
- 四十三、AutoDim [2021]
- 四十四、PEP [2021]
- 四十五、DeepLight [2021]
- 图的表达
- 一、DeepWalk [2014]
- 二、LINE [2015]
- 三、GraRep [2015]
- 四、TADW [2015]
- 五、DNGR [2016]
- 六、Node2Vec [2016]
- 七、WALKLETS [2016]
- 八、SDNE [2016]
- 九、CANE [2017]
- 十、EOE [2017]
- 十一、metapath2vec [2017]
- 十二、GraphGAN [2018]
- 十三、struc2vec [2017]
- 十四、GraphWave [2018]
- 十五、NetMF [2017]
- 十六、NetSMF [2019]
- 十七、PTE [2015]
- 十八、HNE [2015]
- 十九、AANE [2017]
- 二十、LANE [2017]
- 二十一、MVE [2017]
- 二十二、PMNE [2017]
- 二十三、ANRL [2018]
- 二十四、DANE [2018]
- 二十五、HERec [2018]
- 二十六、GATNE [2019]
- 二十七、MNE [2018]
- 二十八、MVN2VEC [2018]
- 二十九、SNE [2018]
- 三十、ProNE [2019]
- Graph Embedding 综述
- 图神经网络
- 一、GNN [2009]
- 二、Spectral Networks 和 Deep Locally Connected Networks [2013]
- 三、Fast Localized Spectral Filtering On Graph [2016]
- 四、GCN [2016]
- 五、神经图指纹 [2015]
- 六、GGS-NN [2016]
- 七、PATCHY-SAN [2016]
- 八、GraphSAGE [2017]
- 九、GAT [2017]
- 十、R-GCN [2017]
- 十一、 AGCN [2018]
- 十二、FastGCN [2018]
- 十三、PinSage [2018]
- 十四、GCMC [2017]
- 十五、JK-Net [2018]
- 十六、PPNP [2018]
- 十七、VRGCN [2017]
- 十八、ClusterGCN [2019]
- 十九、LDS-GNN [2019]
- 二十、DIAL-GNN [2019]
- 二十一、HAN [2019]
- 二十二、HetGNN [2019]
- 二十三、HGT [2020]
- 二十四、GPT-GNN [2020]
- 二十五、Geom-GCN [2020]
- 二十六、Graph Network [2018]
- 二十七、GIN [2019]
- 二十八、MPNN [2017]
- 二十九、UniMP [2020]
- 三十、Correct and Smooth [2020]
- 三十一、LGCN [2018]
- 三十二、DGCNN [2018]
- 三十三、AS-GCN
- 三十四、DGI [2018]
- 三十五、DIFFPOLL [2018]
- 三十六、DCNN [2016]
- 三十七、IN [2016]
- 图神经网络 2
- 图神经网络 3
- 推荐算法(传统方法)
- 一、Tapestry [1992]
- 二、GroupLens [1994]
- 三、ItemBased CF [2001]
- 四、Amazon I-2-I CF [2003]
- 五、Slope One Rating-Based CF [2005]
- 六、Bipartite Network Projection [2007]
- 七、Implicit Feedback CF [2008]
- 八、PMF [2008]
- 九、SVD++ [2008]
- 十、MMMF 扩展 [2008]
- 十一、OCCF [2008]
- 十二、BPR [2009]
- 十三、MF for RS [2009]
- 十四、 Netflix BellKor Solution [2009]
- 推荐算法(神经网络方法 1)
- 一、MIND [2019](用于召回)
- 二、DNN For YouTube [2016]
- 三、Recommending What Video to Watch Next [2019]
- 四、ESAM [2020]
- 五、Facebook Embedding Based Retrieval [2020](用于检索)
- 六、Airbnb Search Ranking [2018]
- 七、MOBIUS [2019](用于召回)
- 八、TDM [2018](用于检索)
- 九、DR [2020](用于检索)
- 十、JTM [2019](用于检索)
- 十一、Pinterest Recommender System [2017]
- 十二、DLRM [2019]
- 十三、Applying Deep Learning To Airbnb Search [2018]
- 十四、Improving Deep Learning For Airbnb Search [2020]
- 十五、HOP-Rec [2018]
- 十六、NCF [2017]
- 十七、NGCF [2019]
- 十八、LightGCN [2020]
- 十九、Sampling-Bias-Corrected Neural Modeling [2019](检索)
- 二十、EGES [2018](Matching 阶段)
- 二十一、SDM [2019](Matching 阶段)
- 二十二、COLD [2020 ] (Pre-Ranking 模型)
- 二十三、ComiRec [2020](https://www.wenjiangs.com/doc/0b4e1736-ac78)
- 二十四、EdgeRec [2020]
- 二十五、DPSR [2020](检索)
- 二十六、PDN [2021](mathcing)
- 二十七、时空周期兴趣学习网络ST-PIL [2021]
- 推荐算法之序列推荐
- 一、FPMC [2010]
- 二、GRU4Rec [2015]
- 三、HRM [2015]
- 四、DREAM [2016]
- 五、Improved GRU4Rec [2016]
- 六、NARM [2017]
- 七、HRNN [2017]
- 八、RRN [2017]
- 九、Caser [2018]
- 十、p-RNN [2016]
- 十一、GRU4Rec Top-k Gains [2018]
- 十二、SASRec [2018]
- 十三、RUM [2018]
- 十四、SHAN [2018]
- 十五、Phased LSTM [2016]
- 十六、Time-LSTM [2017]
- 十七、STAMP [2018]
- 十八、Latent Cross [2018]
- 十九、CSRM [2019]
- 二十、SR-GNN [2019]
- 二十一、GC-SAN [2019]
- 二十二、BERT4Rec [2019]
- 二十三、MCPRN [2019]
- 二十四、RepeatNet [2019]
- 二十五、LINet(2019)
- 二十六、NextItNet [2019]
- 二十七、GCE-GNN [2020]
- 二十八、LESSR [2020]
- 二十九、HyperRec [2020]
- 三十、DHCN [2021]
- 三十一、TiSASRec [2020]
- 推荐算法(综述)
- 多任务学习
- 系统架构
- 实践方法论
- 深度强化学习 1
- 自动代码生成
工具
- CRF
- lightgbm
- xgboost
- scikit-learn
- spark
- numpy
- matplotlib
- pandas
- huggingface_transformer
- 一、Tokenizer
- 二、Datasets
- 三、Model
- 四、Trainer
- 五、Evaluator
- 六、Pipeline
- 七、Accelerate
- 八、Autoclass
- 九、应用
- 十、Gradio
Scala
- 环境搭建
- 基础知识
- 函数
- 类
- 样例类和模式匹配
- 测试和注解
- 集合 collection(一)
- 集合collection(二)
- 集成 Java
- 并发
文章来源于网络收集而来,版权归原创者所有,如有侵权请及时联系!
三、Hugging Face Tokenizer 库
对于
BPE, WordPiece, Unigram
这三个算法,我们采用相同的语料库如下:xxxxxxxxxx
corpus = [ # The first sentences from the abstract of "<Attention Is All You Need>" "The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks that include an encoder and a decoder.", "The bestperforming models also connect the encoder and decoder through an attentionmechanism.", "We propose a new simple network architecture, the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutionsentirely." ]
2.1 BPE
训练算法:
xxxxxxxxxx
from collections import defaultdict from tokenizers import decoders, models, normalizers, \ pre_tokenizers, processors, trainers, Tokenizer corpus = [ # The first sentences from the abstract of "<Attention Is All You Need>" "The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks that include an encoder and a decoder.", "The bestperforming models also connect the encoder and decoder through an attentionmechanism.", "We propose a new simple network architecture, the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutionsentirely." ] #################### Step1: word freq ################ word_freqs = defaultdict(int) pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) for text in corpus: words_with_offsets = pre_tokenizer.pre_tokenize_str(text) new_words = [word for word, offset in words_with_offsets] for word in new_words: word_freqs[word] += 1 print(word_freqs) # defaultdict(<class 'int'>, {'The': 2, 'Ġdominant': 1, 'Ġsequence': 1, 'Ġtransduction': 1, ...}) #################### Step2: alphabet ################ alphabet = [] # 字母表 for word in word_freqs.keys(): for letter in word: if letter not in alphabet: alphabet.append(letter) alphabet.sort() print(alphabet) # 'Ġ' 是空格符 # [',', '.', 'T', 'W', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'Ġ'] vocab = ["<|endoftext|>"] + alphabet.copy() # add special token for GPT-2 #################### Step3: split word to char ################ splits = {word: [c for c in word] for word in word_freqs.keys()} print(splits) # 每个字符作为一个 subword # {'The': ['T', 'h', 'e'], 'Ġdominant': ['Ġ', 'd', 'o', 'm', 'i', 'n', 'a', 'n', 't'],...} #################### Step4: find most freq and merge ################ def compute_pair_freqs(splits): ''' 计算相邻子词合并之后作为一个整体所出现的频次 :param splits: 截止到目前为止,每个单词的拆分 ''' pair_freqs = defaultdict(int) for word, freq in word_freqs.items(): split = splits[word] if len(split) == 1: continue for i in range(len(split) - 1): pair = (split[i], split[i + 1]) pair_freqs[pair] += freq return pair_freqs def find_most_freq(pair_freqs): ''' 计算频次最高的子词 ''' best_pair = "" max_freq = None for pair, freq in pair_freqs.items(): if max_freq is None or max_freq < freq: best_pair = pair max_freq = freq print("\t Find most freq: pair[%s], freq[%s]"%(best_pair, max_freq)) return best_pair def merge_pair(a, b, splits): ''' 子词合并,将当前 splits 中的所有 "a b" 形式的子词合并为 "ab" ''' combine_ab = "%s%s"%(a,b) for word in word_freqs: split = splits[word] # word 当前的子词拆分 if len(split) == 1: # 子词只有一个,表示子词就是 word 自身 continue i = 0 while i < len(split) - 1: if split[i] == a and split[i + 1] == b: # a 和 b 连续出现,可以合并 split = split[:i] + [combine_ab, ] + split[i + 2 :] else: i += 1 splits[word] = split return splits merges = {} vocab_size = 50 while len(vocab) < vocab_size: print("Current vocab size:%s"%len(vocab)) pair_freqs = compute_pair_freqs(splits) print("\t Top3 Pair freq:%s"% sorted(pair_freqs.items(),key=lambda x:-x[1])[:3]) # 频次降序排列 current_pair = find_most_freq(pair_freqs) new_subword = "%s%s"%(current_pair[0],current_pair[1]) splits = merge_pair(current_pair[0], current_pair[1], splits) print("\t Merge '%s %s' to '%s'"%(current_pair[0], current_pair[1], new_subword)) merges[current_pair] = new_subword vocab.append(new_subword) # Current vocab size:30 # Top3 Pair freq:[(('Ġ', 'm'), 3), (('l', 's'), 3), (('Ġ', 'c'), 3)] # Find most freq: pair[('Ġ', 'm')], freq[3] # Merge 'Ġ m' to 'Ġm' # Current vocab size:31 # Top3 Pair freq:[(('l', 's'), 3), (('Ġ', 'c'), 3), (('l', 'e'), 3)] # Find most freq: pair[('l', 's')], freq[3] # Merge 'l s' to 'ls' # ... print(merges) # 20 条 merge 规则 # {('Ġ', 'm'): 'Ġm', ('l', 's'): 'ls', ('Ġ', 'c'): 'Ġc', ('l', 'e'): 'le', ...} print(vocab) # 词表由 special token、初始字母表、以及 merge结果所组成 # ['<|endoftext|>', ',', '.', 'T', 'W', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'Ġ', 'Ġm', 'ls', 'Ġc', 'le', 'lu', 'Ġand', 'is', 'The', 'Ġd', 'om', 'ence', 'ran', 'rans', 'Ġmode', 'Ġmodels', 'Ġar', 'Ġb', 'ase', 'ased', 'Ġon']为了对新文本进行
tokenization
,我们对其进行pre-tokenization
、拆分为单个字符,然后应用学到的所有merge
规则。xxxxxxxxxx
def tokenize(text, merges): ''' Tokenization, text 为文本, merges 为学到的所有 merge 规则 ''' ################## step1: pre_tokenize ################## pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False) pre_tokenize_result = pre_tokenizer.pre_tokenize_str(text) pre_tokenized_text = [word for word, offset in pre_tokenize_result] ################## step2: split ################## splits = [[ch for ch in word] for word in pre_tokenized_text] ################## step3: tokenize ################## for pair, merge in sorted(merges.items(), key=lambda x: -len(x[1])): # 先合并短的子词、后合并长的子词 for idx, split in enumerate(splits): i = 0 ########### 处理每一个 split ######## while i < len(split) - 1: if split[i] == pair[0] and split[i + 1] == pair[1]: split = split[:i] + [merge] + split[i + 2 :] else: i += 1 splits[idx] = split return sum(splits, []) print(tokenize("This's me ." ,merges)) # ['T', 'h', 'is', "'", 's', 'Ġm', 'e', 'Ġ', 'Ġ', '.']
2.2 WordPiece
训练算法:
xxxxxxxxxx
from collections import defaultdict from tokenizers import pre_tokenizers corpus = [ # The first sentences from the abstract of "<Attention Is All You Need>" "The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks that include an encoder and a decoder.", "The bestperforming models also connect the encoder and decoder through an attentionmechanism.", "We propose a new simple network architecture, the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutionsentirely." ] #################### Step1: word freq ################ word_freqs = defaultdict(int) pre_tokenizer = pre_tokenizers.BertPreTokenizer() for text in corpus: words_with_offsets = pre_tokenizer.pre_tokenize_str(text) new_words = [word for word, offset in words_with_offsets] for word in new_words: word_freqs[word] += 1 print(word_freqs) # defaultdict(<class 'int'>, {'The': 2, 'dominant': 1, 'sequence': 1, ...}) #################### Step2: alphabet ################ alphabet = [] # 字母表 for word in word_freqs.keys(): if word[0] not in alphabet: # 是单词的第一个字母 alphabet.append(word[0]) for letter in word[1:]: # 不是单词的第一个字母 if f"##{letter}" not in alphabet: # f"{letter}" 是格式化的语法,用 letter 变量的真实值来替代 {letter} alphabet.append(f"##{letter}") alphabet.sort() print(alphabet) # ['##a', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##q', '##r', '##s', '##t', '##u', '##v', '##w', '##x', '##y', ',', '.', 'T', 'W', 'a', 'b', 'c', 'd', 'e', 'i', 'm', 'n', 'o', 'p', 'r', 's', 't', 'w'] vocab = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] + alphabet.copy() # add special token #################### Step3: split word to char ################ splits = { word: [c if i == 0 else f"##{c}" for i, c in enumerate(word)] for word in word_freqs.keys() } print(splits) # 每个字符作为一个 subword # {'The': ['T', '##h', '##e'], 'dominant': ['d', '##o', '##m', '##i', '##n', '##a', '##n', '##t'],...} #################### Step4: find highest score and merge ################ def compute_pair_scores(splits): ''' 计算每对相邻子词 merge 操作的得分 :param splits: 截止到目前为止,每个单词的拆分 ''' letter_freqs = defaultdict(int) pair_freqs = defaultdict(int) for word, freq in word_freqs.items(): split = splits[word] if len(split) == 1: # 只有一个子词(就是单词自身) letter_freqs[split[0]] += freq continue for i in range(len(split) - 1): # 有多个子词 pair = (split[i], split[i + 1]) letter_freqs[split[i]] += freq pair_freqs[pair] += freq letter_freqs[split[-1]] += freq # 最后一个位置没有 pair,但是要处理 scores = { pair: freq / (letter_freqs[pair[0]] * letter_freqs[pair[1]]) for pair, freq in pair_freqs.items() } return scores def find_max_score(scores): ''' 计算得分最高的子词 ''' best_pair = "" max_score = None for pair, score in scores.items(): if max_score is None or max_score < score: best_pair = pair max_score = score print("\t Find max score: pair[%s], freq[%s]"%(best_pair, max_score)) return best_pair def merge_pair(a, b, splits): ''' 子词合并,将当前 splits 中的所有 "a b" 形式的子词合并为 "ab" ''' combine_ab = "%s%s"%(a,b[2:] if b.startswith("##") else b) for word in word_freqs: split = splits[word] # word 当前的子词拆分 if len(split) == 1: # 子词只有一个,表示子词就是 word 自身 continue i = 0 while i < len(split) - 1: if split[i] == a and split[i + 1] == b: # a 和 b 连续出现,可以合并 split = split[:i] + [combine_ab, ] + split[i + 2 :] else: i += 1 splits[word] = split return splits vocab_size = 50 while len(vocab) < vocab_size: print("Current vocab size:%s"%len(vocab)) scores = compute_pair_scores(splits) print("\t Top3 Pair scores:%s"% sorted(scores.items(),key=lambda x:-x[1])[:3]) # 得分降序排列 current_pair = find_max_score(scores) new_subword = "%s%s"%(current_pair[0],current_pair[1][2:] if current_pair[1].startswith("##") else current_pair[1]) splits = merge_pair(current_pair[0], current_pair[1], splits) print("\t Merge '%s %s' to '%s'"%(current_pair[0], current_pair[1], new_subword)) vocab.append(new_subword) # Current vocab size:46 # Top3 Pair scores:[(('##q', '##u'), 0.1), (('##l', '##y'), 0.076923), (('t', '##h'), 0.072727)] # Find max score: pair[('##q', '##u')], freq[0.1] # Merge '##q ##u' to '##qu' # Current vocab size:47 # Top3 Pair scores:[(('##l', '##y'), 0.076923), (('t', '##h'), 0.072727), (('b', '##a'), 0.066667)] # Find max score: pair[('##l', '##y')], freq[0.076923] # Merge '##l ##y' to '##ly' # ... print(vocab) # 词表由 special token、初始字母表、以及 merge结果所组成 # ['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]', '##a', '##c', '##d', '##e', '##f', '##g', '##h', '##i', '##k', '##l', '##m', '##n', '##o', '##p', '##q', '##r', '##s', '##t', '##u', '##v', '##w', '##x', '##y', ',', '.', 'T', 'W', 'a', 'b', 'c', 'd', 'e', 'i', 'm', 'n', 'o', 'p', 'r', 's', 't', 'w', '##qu', '##ly', 'th', 'Th']为了对新文本进行
tokenization
,我们对其进行pre-tokenization
,然后对每个单词寻找从头开始匹配到的最大子词并进行拆分。然后不断重复这种拆分。xxxxxxxxxx
def encode_word(word, vocab): ''' 用 WordPiece 对单词进行拆分 ''' tokens = [] while len(word) > 0: i = len(word) while i > 0 and word[:i] not in vocab: # 最长匹配 i -= 1 if i == 0: return ["[UNK]"] tokens.append(word[:i]) # 匹配到的最长子词 word = word[i:] # 拆剩下的 if len(word) > 0: word = f"##{word}" return tokens def tokenize(text, vocab): ''' 对文本进行 tokenize. vocab 为词表 ''' pre_tokenizer = pre_tokenizers.BertPreTokenizer() pre_tokenize_result = pre_tokenizer.pre_tokenize_str(text) pre_tokenized_text = [word for word, offset in pre_tokenize_result] encoded_words = [encode_word(word, vocab) for word in pre_tokenized_text] return sum(encoded_words, []) # 对列表的列表进行 flatten 处理 print(tokenize("This's me ." ,vocab)) # ['Th', '##i', '##s', '[UNK]', 's', 'm', '##e', '.']
2.3 Unigram
训练算法:
xxxxxxxxxx
from collections import defaultdict from tokenizers import pre_tokenizers from math import log import copy corpus = [ # The first sentences from the abstract of "<Attention Is All You Need>" "The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks that include an encoder and a decoder.", "The bestperforming models also connect the encoder and decoder through an attentionmechanism.", "We propose a new simple network architecture, the Transformer,based solely on attention mechanisms, dispensing with recurrence and convolutionsentirely." ] #################### Step1: word freq ################ word_freqs = defaultdict(int) pre_tokenizer = pre_tokenizers.Metaspace() for text in corpus: words_with_offsets = pre_tokenizer.pre_tokenize_str(text) new_words = [word for word, offset in words_with_offsets] for word in new_words: word_freqs[word] += 1 print(word_freqs) # defaultdict(<class 'int'>, {'▁The': 2, '▁dominant': 1, '▁sequence': 1, ...}) #################### Step2: initial vocab ################ char_freqs = defaultdict(int) # 每个字符的频次 subwords_freqs = defaultdict(int) # 每个 substring 的频次 for word, freq in word_freqs.items(): for i in range(len(word)): char_freqs[word[i]] += freq # Loop through the subwords of length at least 2 for j in range(i + 2, len(word) + 1): subwords_freqs[word[i:j]] += freq sorted_subwords = sorted(subwords_freqs.items(), key=lambda x: x[1], reverse=True) init_vocab_size = 300 # 一个较大的初始词表 token_freqs = list(char_freqs.items()) + sorted_subwords[: init_vocab_size - len(char_freqs)] token_freqs = {token: freq for token, freq in token_freqs} print(sorted_subwords[:5]) # [('▁a', 12), ('an', 10), ('on', 10), ('en', 9), ('de', 9)] #################### Step3: model ################ total_sum = sum([freq for token, freq in token_freqs.items()]) # model 存放每个候选 token 的负对数似然 model = {token: -log(freq*1.0 / total_sum) for token, freq in token_freqs.items()} #################### Step4: 定义编码函数和损失函数 ################ def encode_word(word, model): ''' 这是用动态规划来实现维特比解码,从而根据每个子词的损失来分词 ''' best_segmentations = [{"start": 0, "score": 1}] + [ {"start": None, "score": None} for _ in range(len(word)) ] # 核心数据结构,存放每个位置的状态:第 i 个元素表示对前缀 word[:i] 的分词结果:(最近一个拆分点, 最佳分词的损失) for start_idx in range(len(word)): # This should be properly filled by the previous steps of the loop best_score_at_start = best_segmentations[start_idx]["score"] # 前缀的分词结果 ######### 寻找下一个拆分点 ############# for end_idx in range(start_idx + 1, len(word) + 1): token = word[start_idx:end_idx] if token in model and best_score_at_start is not None: score = model[token] + best_score_at_start if ( best_segmentations[end_idx]["score"] is None or best_segmentations[end_idx]["score"] > score # 损失更小 ): best_segmentations[end_idx] = {"start": start_idx, "score": score} segmentation = best_segmentations[-1] # 最后一个位置就是最终的分词结果 if segmentation["score"] is None: # We did not find a tokenization of the word -> unknown return ["<unk>"], None score = segmentation["score"] start = segmentation["start"] # 前一个拆分点 end = len(word) tokens = [] while start != 0: tokens.insert(0, word[start:end]) next_start = best_segmentations[start]["start"] end = start start = next_start tokens.insert(0, word[start:end]) return tokens, score def compute_loss(model): ''' 计算当前语料库和模型的整体损失 ''' loss = 0 for word, freq in word_freqs.items(): _, word_loss = encode_word(word, model) loss += freq * word_loss return loss def compute_scores(model): ''' 通过计算移除每个 token 的损失变化,从而计算每个 token 的得分 ''' scores = {} model_loss = compute_loss(model) for token, score in model.items(): if len(token) == 1: # 总是保留单个字符 continue model_without_token = copy.deepcopy(model) _ = model_without_token.pop(token) scores[token] = compute_loss(model_without_token) - model_loss return scores #################### Step5: 缩减词表 ################ percent_to_remove = 0.1 # 每轮迭代缩小 10% max_vocab_size = 100 # 词表的最大规模 while len(model) > max_vocab_size: scores = compute_scores(model) sorted_scores = sorted(scores.items(), key=lambda x: x[1]) print("Top3 scores:%s"%sorted_scores[-3:]) for i in range(int(len(model) * percent_to_remove)): # 移除最小的 10% _ = token_freqs.pop(sorted_scores[i][0]) ### 重建 model ### total_sum = sum([freq for token, freq in token_freqs.items()]) model = {token: -log(freq*1.0 / total_sum) for token, freq in token_freqs.items()} # Top3 scores:[('ing', 8.45913446432769), ('form', 9.041467278547316), ('▁and', 9.270398846926355)] # Top3 scores:[('form', 8.756385177048287), ('▁and', 8.84277569467804), ('tion', 9.158034534900253)] # Top3 scores:[('rans', 11.55887624144998), ('▁The', 13.833700317065222), ('▁models', 21.35200333126363)] # ...为了对新文本进行
tokenization
,我们对其进行pre-tokenization
,然后对每个单词进行维特比解码。xxxxxxxxxx
def tokenize(text, model): ''' 对文本进行 tokenize. ''' words_with_offsets = pre_tokenizers.Metaspace().pre_tokenize_str(text) pre_tokenized_text = [word for word, offset in words_with_offsets] encoded_words = [encode_word(word, model)[0] for word in pre_tokenized_text] return sum(encoded_words, []) print(tokenize("This's me ." ,model)) # ['<unk>', '▁', 'me', '▁', '▁', '.']
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论