如何使用MT5使用自定义令牌?
我想通过说我是ML的新手来序言。我试图将MT5用作练习。
我为MT培训了MT5型号,但现在想使用自定义令牌。我已经预先制作了一个BPE令牌,并根据 huggingface docs 。
我知道我可以做:
from transformers import MT5Model, T5Tokenizer
model = MT5Model.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
summary = "Weiter Verhandlung in Syrien."
inputs = tokenizer(article, return_tensors="pt")
with tokenizer.as_target_tokenizer():
labels = tokenizer(summary, return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
hidden_states = outputs.last_hidden_state
按 docs ,我假设我可以加载我的我令牌代替了Google的功能,但我敢肯定,在此之后会发生什么,我不知道Hidden_States去了哪里,或在该片段之后该怎么办。
这就是我到目前为止所拥有的:
model_args = T5Args()
model_args.max_seq_length = 64
model_args.train_batch_size = 15
model_args.eval_batch_size = 15
model_args.num_train_epochs = 20
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 30000
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_eval_checkpoints = False
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1
model = T5Model("mt5", "google/mt5-base", args=model_args)
之后,我几乎只是做:model.train_model(train_df,eval_data = eval_df)
将其训练。
我应该去哪里插入自定义令牌?我无法弄清楚...
主要问题:
- 我想在MT5模型中添加自定义令牌,我确实知道,根据文档,我可以预订一个(我已经创建并保存了) 。我的问题是,在我张贴顶部的片段之后,我该怎么办?我不确定
hidden_states
之后会发生什么,因为我当前使用的方法不需要任何方法。也就是说,当前的方法我刚刚使用了上面提到的t5args()
,然后对此进行训练。
I'd like to preface this by saying that I'm new to ML in general. I'm attempting to use mt5 as a practice.
I trained an mt5 model for MT, but would like to now use a custom tokenizer. I have pre-made a BPE tokenizer and saved it as per the huggingface docs.
I know I can do:
from transformers import MT5Model, T5Tokenizer
model = MT5Model.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
summary = "Weiter Verhandlung in Syrien."
inputs = tokenizer(article, return_tensors="pt")
with tokenizer.as_target_tokenizer():
labels = tokenizer(summary, return_tensors="pt")
outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
hidden_states = outputs.last_hidden_state
as per the docs and I assume I can just load my tokenizer instead of Google's but I'm sure what happens after this, I don't know where the hidden_states go or what to do with that after that snippet.
This is what I have so far:
model_args = T5Args()
model_args.max_seq_length = 64
model_args.train_batch_size = 15
model_args.eval_batch_size = 15
model_args.num_train_epochs = 20
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 30000
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_eval_checkpoints = False
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1
model = T5Model("mt5", "google/mt5-base", args=model_args)
After this I pretty much just do: model.train_model(train_df, eval_data=eval_df)
to train it.
Where should I go about inserting the custom tokenizer? I can't figure it out...
The main question:
- I want to add a custom tokenizer to the mt5 model, I do understand that, according to the docs, I can just preload one (which I have already created and saved). My question is, what do I do after the snippet that I posted up top? I am unsure what happens after
hidden_states
given that the approach I used, currently, has no need of any of it. That is, the current approach I have just uses theT5Args()
I mentioned above and I train with that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这里有几个点可以看到!
首先,请注意,每个变压器模型在大规模的语料库上都有很长的训练阶段。此阶段正在帮助模型在任何情况下对任何单词都有一般的想法。在大多数情况下,这个阶段不能由个别研究人员正确地完成。
令牌剂识别的每个单词在嵌入层中都有一个相应的条目(模型的第一层),如果新的令牌将不同的
token_id
分配给了它。消除了模型所获得的所有知识。在变形金刚体系结构中操纵令牌的操纵令牌的唯一合理方法是添加新单词(如果不包括在内),然后根据手头的任务在定制的语料库上进行芬特。
There are several points to spot on here!
First of all, keep in your mind that every transformer model has had a long pre-training phase on the massive corpus. This phase is helping model with having a general idea about any word in any context. In most cases, this phase can't be done properly and completely by individual researchers.
Every word recognized by the tokenizer has a corresponding entry in the embedding layer (the very first layer of the model), if the new tokenizer assigned a different
token_id
to the word it eliminates all knowledge that has been gained by the model.The only reasonable way to manipulate tokenizer in transformers architecture is adding new words (if they weren't included) and then finetune on a customized corpus based on the task at hand.