拥抱面 - 为什么T5模型缩短句子？

发布于 2025-02-13 08:24:32 字数 2576 浏览 1 评论 0原文

我想训练模型进行咒语校正。我培训了两种型号Allegro/plt5-base，其中包括波兰语句子和带英文句子的Google/t5-v1_1基础。不幸的是，我不知道出于什么原因，但两种模型都缩短了句子。示例：

phrases = ['The name of the man who was kild was Jack Robbinson he has black hair brown eyes blue Jacket and blue Jeans.']
encoded = tokenizer(phrases, return_tensors="pt", padding=True, max_length=512, truncation=True)
print(encoded)
# {'input_ids': tensor([[   37,   564,    13,     8,   388,   113,    47,     3,   157,   173,
#             26,    47,  4496,  5376,  4517,   739,     3,    88,    65,  1001,
#           1268,  4216,  2053,  1692, 24412,    11,  1692,  3966,     7,     5,
#              1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#          1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

encoded.to('cuda')
translated = model.generate(**encoded)
print(translated)
# tensor([[   0,   37,  564,   13,    8,  388,  113,   47, 2170,   47, 4496, 5376,
#          4517,  739,    3,   88,   65, 1001, 1268, 4216]], device='cuda:0')

tokenizer.batch_decode(translated, skip_special_tokens=True)
#['The name of the man who was born was Jack Robbinson he has black hair brown']

类似的事情几乎每年句子都会发生。我试图检查模型是否基于文档设置了任何最大句子长度， https://huggingface.co/transformers/v3.1.0/model_doc/t5.html 。但是该模型的配置没有这样的字段： n_positions - 该模型可以使用的最大序列长度。通常将其设置为大的东西，例如（例如512或1024或2048）。也可以通过属性max_position_embeddings访问n_positions。 这是模型的整个配置：

T5Config {
  "_name_or_path": "final_model_t5_800_000",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_cache": true,
  "vocab_size": 32128
}

如何使模型返回整个句子？

更新

我之前在旧文档中查看。但是在新的句子中，我看不到配置中的一个字段。新文档

原文

I wanted to train the model for spell correction. I trained two models allegro/plt5-base with polish sentences and google/t5-v1_1-base with english sentences. Unfortunately, I don't know for what reason, but both models shorten the sentences.
Example:

phrases = ['The name of the man who was kild was Jack Robbinson he has black hair brown eyes blue Jacket and blue Jeans.']
encoded = tokenizer(phrases, return_tensors="pt", padding=True, max_length=512, truncation=True)
print(encoded)
# {'input_ids': tensor([[   37,   564,    13,     8,   388,   113,    47,     3,   157,   173,
#             26,    47,  4496,  5376,  4517,   739,     3,    88,    65,  1001,
#           1268,  4216,  2053,  1692, 24412,    11,  1692,  3966,     7,     5,
#              1]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
#          1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

encoded.to('cuda')
translated = model.generate(**encoded)
print(translated)
# tensor([[   0,   37,  564,   13,    8,  388,  113,   47, 2170,   47, 4496, 5376,
#          4517,  739,    3,   88,   65, 1001, 1268, 4216]], device='cuda:0')

tokenizer.batch_decode(translated, skip_special_tokens=True)
#['The name of the man who was born was Jack Robbinson he has black hair brown']

And something like this happens in almost every longer sentence. I tried to check if the model has any maximum sentence length set based on the documentation: https://huggingface.co/transformers/v3.1.0/model_doc/t5.html. But the config of this model has no such field:
n_positions – The maximum sequence length that this model might ever be used with. Typically set this to something large just in case (e.g., 512 or 1024 or 2048). n_positions can also be accessed via the property max_position_embeddings.
This is the entire config of the model:

T5Config {
  "_name_or_path": "final_model_t5_800_000",
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "use_cache": true,
  "vocab_size": 32128
}

What can be done to make the model return whole sentences?

Update

I looked in the old documentation earlier. But in the new one I don't see a field in the config at all about the maximum sentence length. new documentation

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

宛菡 2025-02-20 08:24:32

我已经设法解决了这个问题。在使用模型生成令牌时，必须添加max_length参数，如下所示：

translated = self._model.generate(**encoded, max_length=1024)

结果，模型不再是截断句子。

I have already managed to solve the problem. When generating the tokens with the model, the max_length parameter had to be added, as below:

translated = self._model.generate(**encoded, max_length=1024)

As a result, the model was no longer truncating sentences.

回复收藏 0 原文

~没有更多了~

关于作者

爱你是孤单的心事

暂无简介

文章

433 人气

关注发私信

友情链接

文江博客

拥抱面 - 为什么T5模型缩短句子？

更新

Update

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

拥抱面 - 为什么T5模型缩短句子？

更新

Update

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。