特殊代币有什么特别之处？

发布于 2025-01-18 05:19:22 字数 570 浏览 1 评论 0原文

“令牌”和“特殊令牌”到底有什么区别？

我了解以下内容：

什么是典型的令牌，
什么是典型的特殊令牌：蒙版，unk，sep等。
当您添加令牌（当您想扩展词汇时）

我不理解的是什么是什么样的您是否想创建一个新的特殊令牌，任何示例我们需要的内容，以及当我们想创建一个特殊令牌时，除了那些默认的特殊令牌之外，还要创建一个特殊的令牌？如果一个示例使用特殊令牌，为什么普通令牌不能实现相同的目标？

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

而且我也不太了解源文档中的以下描述。如果我们将add_special_tokens设置为false，它对我们的模型有什么区别？

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

原文

what exactly is the difference between "token" and a "special token"?

I understand the following:

what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
when do you add a token (when you want to expand your vocab)

What I don't understand is, under what kind of capacity will you want to create a new special token, any examples what we need it for and when we want to create a special token other than those default special tokens? If an example uses a special token, why can't a normal token achieve the same objective?

tokenizer.add_tokens(['[EOT]'], special_tokens=True)

And I also dont quite understand the following description in the source documentation.
what difference does it do to our model if we set add_special_tokens to False?

add_special_tokens (bool, optional, defaults to True) — Whether or not to encode the sequences with the special tokens relative to their model.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

初相遇 2025-01-25 05:19:22

特殊令牌被称为特殊，因为它们不是从您的输入中得出的。它们是出于一定目的而添加的，并且独立于特定输入。

我不明白的是，您想要什么样的能力
要创建一个新的特价令牌，任何示例我们需要的东西
当我们想创建一个除默认值之外的特殊令牌时
特殊令牌？

仅仅一个例子，在提取性对话提问中，将上一个对话框转换的问题和答案添加到您的输入中以为您的模型提供一些上下文并不罕见。这些先前的对话框转弯与当前问题的特殊令牌分开。有时，人们使用模型的分离器令牌或引入新的特殊令牌。以下是一个新的特殊令牌[q]的示例

#first dialog turn - no conversation history
[CLS] current question [SEP] text [EOS]
#second dialog turn - with previous question to have some context
[CLS] previous question [Q] current question [SEP] text [EOS]

我也不非常了解以下描述
来源文档。如果我们与我们的模型有什么区别
将add_special_tokens设置为false？

from transformers import RobertaTokenizer
t = RobertaTokenizer.from_pretrained("roberta-base")

t("this is an example")
#{'input_ids': [0, 9226, 16, 41, 1246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

t("this is an example", add_special_tokens=False)
#{'input_ids': [9226, 16, 41, 1246], 'attention_mask': [1, 1, 1, 1]}

正如您在这里看到的那样，输入会错过两个令牌（特殊令牌）。这些特殊的令牌对您的模型有含义，因为它接受了培训。由于缺乏这两个令牌，Last_hidden_state将有所不同，因此将导致下游任务的不同结果。

某些任务（例如序列分类）经常使用[CLS]令牌来做出预测。当您删除它们时，通过[CLS]代币预先训练的模型将挣扎。

Special tokens are called special because they are not derived from your input. They are added for a certain purpose and are independent of the specific input.

What I don't understand is, under what kind of capacity will you want
to create a new special token, any examples what we need it for and
when we want to create a special token other than those default
special tokens?

Just an example, in extractive conversational question-answering it is not unusual to add the question and answer of the previous dialog-turn to your input to provide some context for your model. Those previous dialog turns are separated with special tokens from the current question. Sometimes people use the separator token of the model or introduce new special tokens. The following is an example with a new special token [Q]

#first dialog turn - no conversation history
[CLS] current question [SEP] text [EOS]
#second dialog turn - with previous question to have some context
[CLS] previous question [Q] current question [SEP] text [EOS]

And I also dont quite understand the following description in the
source documentation. what difference does it do to our model if we
set add_special_tokens to False?

from transformers import RobertaTokenizer
t = RobertaTokenizer.from_pretrained("roberta-base")

t("this is an example")
#{'input_ids': [0, 9226, 16, 41, 1246, 2], 'attention_mask': [1, 1, 1, 1, 1, 1]}

t("this is an example", add_special_tokens=False)
#{'input_ids': [9226, 16, 41, 1246], 'attention_mask': [1, 1, 1, 1]}

As you can see here, the input misses two tokens (the special tokens). Those special tokens have a meaning for your model since it was trained with it. The last_hidden_state will be different due to the lack of those two tokens and will therefore lead to a different result for your downstream task.

Some tasks, like sequence classification, often use the [CLS] token to make their predictions. When you remove them, a model that was pre-trained with a [CLS] token will struggle.

回复收藏 0 原文

~没有更多了~