将新词汇标记添加到模型中并将其保存到下游模型
新令牌的平均初始化是否正确?另外,我应该如何保存新的标记生成器(向其添加新标记后)以在下游模型中使用它?
我通过添加新标记并取平均值来训练 MLM 模型。我应该如何使用微调的 MLM 模型来执行新的分类任务?
tokenizer_org = tr.BertTokenizer.from_pretrained("/home/pc/bert_base_multilingual_uncased")
tokenizer.add_tokens(joined_keywords)
model = tr.BertForMaskedLM.from_pretrained("/home/pc/bert_base_multilingual_uncased", return_dict=True)
# prepare input
text = ["Replace me by any text you'd like"]
encoded_input = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
print(encoded_input)
# add embedding params for new vocab words
model.resize_token_embeddings(len(tokenizer))
weights = model.bert.embeddings.word_embeddings.weight
# initialize new embedding weights as mean of original tokens
with torch.no_grad():
emb = []
for i in range(len(joined_keywords)):
word = joined_keywords[i]
# first & last tokens are just string start/end; don't keep
tok_ids = tokenizer_org(word)["input_ids"][1:-1]
tok_weights = weights[tok_ids]
# average over tokens in original tokenization
weight_mean = torch.mean(tok_weights, axis=0)
emb.append(weight_mean)
weights[-len(joined_keywords):,:] = torch.vstack(emb).requires_grad_()
model.to(device)
trainer.save_model("/home/pc/Bert_multilingual_exp_TCM/model_mlm_exp1")
它保存模型、配置、training_args。如何保存新的分词器?
Is the mean initialisation of new tokens correct? Also how should I save new tokenizer( after adding new tokens to it) to use it in downstream model?
I train a MLM model by adding new tokens and taking mean. How should I use the fine tuned MLM model for new classification task?
tokenizer_org = tr.BertTokenizer.from_pretrained("/home/pc/bert_base_multilingual_uncased")
tokenizer.add_tokens(joined_keywords)
model = tr.BertForMaskedLM.from_pretrained("/home/pc/bert_base_multilingual_uncased", return_dict=True)
# prepare input
text = ["Replace me by any text you'd like"]
encoded_input = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
print(encoded_input)
# add embedding params for new vocab words
model.resize_token_embeddings(len(tokenizer))
weights = model.bert.embeddings.word_embeddings.weight
# initialize new embedding weights as mean of original tokens
with torch.no_grad():
emb = []
for i in range(len(joined_keywords)):
word = joined_keywords[i]
# first & last tokens are just string start/end; don't keep
tok_ids = tokenizer_org(word)["input_ids"][1:-1]
tok_weights = weights[tok_ids]
# average over tokens in original tokenization
weight_mean = torch.mean(tok_weights, axis=0)
emb.append(weight_mean)
weights[-len(joined_keywords):,:] = torch.vstack(emb).requires_grad_()
model.to(device)
trainer.save_model("/home/pc/Bert_multilingual_exp_TCM/model_mlm_exp1")
It saves model, config, training_args. How to save the new tokenizer as well??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您要做的是一种向原始文本添加新标记和信息的便捷方法。
huggingface
提供了几种方法来做到这一点,我认为我使用了最简单的一种。输出:
重要的警告:当您操作分词器时,您需要相应地更新模型的嵌入层。像这样的
model.resize_token_embeddings(len(tokenizer))
。What you are going to do is a convenient method for adding new markers and information to raw text.
huggingface
provided several method to do that I used the simplest one IMO.output:
The big caveat : When you manipulated the
tokenizer
you need to update the embedding layer of the model accordingly. Some thing like thismodel.resize_token_embeddings(len(tokenizer))
.