当前位置：文江博客话题详情

huggingface-transformers huggingface-tokenizers sentencepiece

为什么拥抱面T5代币器忽略了某些空格？

发布于 2025-01-28 03:11:06 字数 725 浏览 5 评论 0 原文

我正在使用T5模型和令牌仪进行下游任务。我想将某些Whitespaces添加到令牌器中，例如线结束（\ t）和TAB （\ t）。添加这些代币工作，但是以某种方式，令牌仪总是忽略第二个空格。因此，它将序列化“ \ n \ n” 作为单行结束，序列“ \ n \ n \ n \ n \ n \ n” 被标记为两个行结尾等等。请参阅下面的复制。

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])

tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]

这种行为背后的原因是什么？它是一个错误还是与代币机的工作方式相关的？我注意到，这仅是对添加的空格而发生的，而不会发生其他字符。

有没有方法可以防止令牌忽略重复的空格？

原文

I am using T5 model and tokenizer for a downstream task. I want to add certain whitespaces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n” as a single line ending and the sequence "\n\n\n\n" is tokenized as two line endings and so on. See below to reproduce.

from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])

tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]

what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.

Is there way to prevent tokenizer from ignoring the repeated whitespaces?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

小…红帽 2025-02-04 03:11:06

该行为是通过 tokenize 方法 t5tokenizer strips strips strips代币中如何解释的。可以做的是将令牌' \ n '作为令牌机的特殊令牌。由于特殊令牌从未分开，因此可以按预期工作。

这有点骇客，但似乎有效。

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
print(tokenizer.special_tokens_map)

然后，它可以将'\ n'示为无事。请注意，ADDOKEN很重要，因为以下内容确实而不是工作。

tokenizer.add_special_tokens({"additional_special_tokens": ["\n"]})

编辑

在花费更多时间之后，我实际上找到了一种将其添加为普通令牌而无需使用特殊令牌的方法。该问题的主要原因是即使在令牌化之前就发生的归一化过程。当您添加新令牌时，您可以指定是否应该将其标准化。通过将标准化为false，您可以避免使用添加令牌的连续剥离。

from tokenizers import AddedToken
tokenizer.add_tokens(AddedToken("\n", normalized=False))

您可以在此链接上找到更多信息：

The behaviour is explained by how the tokenize method in T5Tokenizer strips tokens by default. What one can do is adding the token '\n' as a special token to the tokenizer. Because the special tokens are never seperated, it works as expected.

It is a bit hacky but seems to work.

from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
print(tokenizer.special_tokens_map)

Then it tokenizes the '\n' without skipping any occurences. Note that AddedToken is important because somehow the following does NOT work.

tokenizer.add_special_tokens({"additional_special_tokens": ["\n"]})

Edit

After spending more time on it, I actually found a way to add it as a normal token without using special tokens. The main reason for the issue is the normalization process that happens behind the scenes even before the tokenization. When you add a new token, you can specify if it should be normalized or not. By setting normalize to False, you avoid the tokenizer from stripping consecutive occurrences of the added token.

from tokenizers import AddedToken
tokenizer.add_tokens(AddedToken("\n", normalized=False))

You can find more information on this link: https://huggingface.co/course/en/chapter6/4?fw=pt

回复收藏 0 原文

~没有更多了~

关于作者

邮友

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

为什么拥抱面T5代币器忽略了某些空格？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

编辑

Edit

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

为什么拥抱面T5代币器忽略了某些空格？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

编辑

Edit

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。