如何使用MT5使用自定义令牌？

发布于 2025-02-07 00:56:05 字数 2063 浏览 2 评论 0原文

我想通过说我是ML的新手来序言。我试图将MT5用作练习。

我为MT培训了MT5型号，但现在想使用自定义令牌。我已经预先制作了一个BPE令牌，并根据 huggingface docs 。

我知道我可以做：

from transformers import MT5Model, T5Tokenizer

model = MT5Model.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
summary = "Weiter Verhandlung in Syrien."
inputs = tokenizer(article, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(summary, return_tensors="pt")

outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
hidden_states = outputs.last_hidden_state

按 docs ，我假设我可以加载我的我令牌代替了Google的功能，但我敢肯定，在此之后会发生什么，我不知道Hidden_States去了哪里，或在该片段之后该怎么办。

这就是我到目前为止所拥有的：

model_args = T5Args()
model_args.max_seq_length = 64
model_args.train_batch_size = 15
model_args.eval_batch_size = 15
model_args.num_train_epochs = 20
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 30000
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_eval_checkpoints = False
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1

model = T5Model("mt5", "google/mt5-base", args=model_args)

之后，我几乎只是做：model.train_model（train_df，eval_data = eval_df）

将其训练。

我应该去哪里插入自定义令牌？我无法弄清楚...

主要问题：

我想在MT5模型中添加自定义令牌，我确实知道，根据文档，我可以预订一个（我已经创建并保存了）。我的问题是，在我张贴顶部的片段之后，我该怎么办？我不确定hidden_states之后会发生什么，因为我当前使用的方法不需要任何方法。也就是说，当前的方法我刚刚使用了上面提到的t5args（），然后对此进行训练。

原文

I'd like to preface this by saying that I'm new to ML in general. I'm attempting to use mt5 as a practice.

I trained an mt5 model for MT, but would like to now use a custom tokenizer. I have pre-made a BPE tokenizer and saved it as per the huggingface docs.

I know I can do:

from transformers import MT5Model, T5Tokenizer

model = MT5Model.from_pretrained("google/mt5-small")
tokenizer = T5Tokenizer.from_pretrained("google/mt5-small")
article = "UN Offizier sagt, dass weiter verhandelt werden muss in Syrien."
summary = "Weiter Verhandlung in Syrien."
inputs = tokenizer(article, return_tensors="pt")
with tokenizer.as_target_tokenizer():
    labels = tokenizer(summary, return_tensors="pt")

outputs = model(input_ids=inputs["input_ids"], decoder_input_ids=labels["input_ids"])
hidden_states = outputs.last_hidden_state

as per the docs and I assume I can just load my tokenizer instead of Google's but I'm sure what happens after this, I don't know where the hidden_states go or what to do with that after that snippet.

This is what I have so far:

model_args = T5Args()
model_args.max_seq_length = 64
model_args.train_batch_size = 15
model_args.eval_batch_size = 15
model_args.num_train_epochs = 20
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 30000
model_args.use_multiprocessing = False
model_args.fp16 = False
model_args.save_steps = -1
model_args.save_eval_checkpoints = False
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1

model = T5Model("mt5", "google/mt5-base", args=model_args)

After this I pretty much just do: model.train_model(train_df, eval_data=eval_df)

to train it.

Where should I go about inserting the custom tokenizer? I can't figure it out...

The main question:

I want to add a custom tokenizer to the mt5 model, I do understand that, according to the docs, I can just preload one (which I have already created and saved). My question is, what do I do after the snippet that I posted up top? I am unsure what happens after hidden_states given that the approach I used, currently, has no need of any of it. That is, the current approach I have just uses the T5Args() I mentioned above and I train with that.

分享到QQ

分享到微博