拥抱面 - 为什么T5模型缩短句子?
我想训练模型进行咒语校正。我培训了两种型号Allegro/plt5-base,其中包括波兰语句子和带英文句子的Google/t5-v1_1基础。不幸的是,我不知道出于什么原因,但两种模型都缩短了句子。 示例:
phrases = ['The name of the man who was kild was Jack Robbinson he has black hair brown eyes blue Jacket and blue Jeans.']
encoded = tokenizer(phrases, return_tensors="pt", padding=True, max_length=512, truncation=True)
print(encoded)
# {'input_ids': tensor([[ 37, 564, 13, 8, 388, 113, 47, 3, 157, 173,
# 26, 47, 4496, 5376, 4517, 739, 3, 88, 65, 1001,
# 1268, 4216, 2053, 1692, 24412, 11, 1692, 3966, 7, 5,
# 1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
encoded.to('cuda')
translated = model.generate(**encoded)
print(translated)
# tensor([[ 0, 37, 564, 13, 8, 388, 113, 47, 2170, 47, 4496, 5376,
# 4517, 739, 3, 88, 65, 1001, 1268, 4216]], device='cuda:0')
tokenizer.batch_decode(translated, skip_special_tokens=True)
#['The name of the man who was born was Jack Robbinson he has black hair brown']
类似的事情几乎每年句子都会发生。我试图检查模型是否基于文档设置了任何最大句子长度, https://huggingface.co/transformers/v3.1.0/model_doc/t5.html 。但是该模型的配置没有这样的字段: n_positions - 该模型可以使用的最大序列长度。通常将其设置为大的东西,例如(例如512或1024或2048)。也可以通过属性max_position_embeddings访问n_positions。
这是模型的整个配置:
T5Config {
"_name_or_path": "final_model_t5_800_000",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"transformers_version": "4.18.0",
"use_cache": true,
"vocab_size": 32128
}
如何使模型返回整个句子?
更新
我之前在旧文档中查看。但是在新的句子中,我看不到配置中的一个字段。 新文档
I wanted to train the model for spell correction. I trained two models allegro/plt5-base with polish sentences and google/t5-v1_1-base with english sentences. Unfortunately, I don't know for what reason, but both models shorten the sentences.
Example:
phrases = ['The name of the man who was kild was Jack Robbinson he has black hair brown eyes blue Jacket and blue Jeans.']
encoded = tokenizer(phrases, return_tensors="pt", padding=True, max_length=512, truncation=True)
print(encoded)
# {'input_ids': tensor([[ 37, 564, 13, 8, 388, 113, 47, 3, 157, 173,
# 26, 47, 4496, 5376, 4517, 739, 3, 88, 65, 1001,
# 1268, 4216, 2053, 1692, 24412, 11, 1692, 3966, 7, 5,
# 1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
# 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
encoded.to('cuda')
translated = model.generate(**encoded)
print(translated)
# tensor([[ 0, 37, 564, 13, 8, 388, 113, 47, 2170, 47, 4496, 5376,
# 4517, 739, 3, 88, 65, 1001, 1268, 4216]], device='cuda:0')
tokenizer.batch_decode(translated, skip_special_tokens=True)
#['The name of the man who was born was Jack Robbinson he has black hair brown']
And something like this happens in almost every longer sentence. I tried to check if the model has any maximum sentence length set based on the documentation: https://huggingface.co/transformers/v3.1.0/model_doc/t5.html. But the config of this model has no such field:n_positions – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). n_positions can also be accessed via the property max_position_embeddings.
This is the entire config of the model:
T5Config {
"_name_or_path": "final_model_t5_800_000",
"architectures": [
"T5ForConditionalGeneration"
],
"d_ff": 2048,
"d_kv": 64,
"d_model": 768,
"decoder_start_token_id": 0,
"dropout_rate": 0.1,
"eos_token_id": 1,
"feed_forward_proj": "gated-gelu",
"initializer_factor": 1.0,
"is_encoder_decoder": true,
"layer_norm_epsilon": 1e-06,
"model_type": "t5",
"num_decoder_layers": 12,
"num_heads": 12,
"num_layers": 12,
"output_past": true,
"pad_token_id": 0,
"relative_attention_max_distance": 128,
"relative_attention_num_buckets": 32,
"tie_word_embeddings": false,
"torch_dtype": "float32",
"transformers_version": "4.18.0",
"use_cache": true,
"vocab_size": 32128
}
What can be done to make the model return whole sentences?
Update
I looked in the old documentation earlier. But in the new one I don't see a field in the config at all about the maximum sentence length. new documentation
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我已经设法解决了这个问题。在使用模型生成令牌时,必须添加max_length参数,如下所示:
结果,模型不再是截断句子。
I have already managed to solve the problem. When generating the tokens with the model, the max_length parameter had to be added, as below:
As a result, the model was no longer truncating sentences.