您需要将 EOS 和 BOS 代币放入自动编码器变压器中吗？

发布于 2025-01-10 15:38:20 字数 1053 浏览 2 评论 0原文

我开始了解变压器架构，但有些事情我还无法掌握。

在无解码器的转换器中，例如 BERT，分词器始终包含句子前后的标记 CLS 和 SEP。我知道 CLS 既充当 BOS，又充当提供分类信息的单个隐藏输出，但我有点不明白为什么它需要 SEP 来进行掩码语言建模部分。

我将更多地解释我期望获得的实用程序。就我而言，我想训练一个变压器作为自动编码器，所以目标=输入。不会有解码器，因为我的想法是将原始词汇的维度减少到更少的嵌入维度，然后研究（还不确定如何，但会到达那里）减少的空间以提取有用的信息。

因此，举个例子：

string_input = "The cat is black" 
tokens_input =  [1,2,3,4]

string_target = "The cat is black"
tokens_output = [1,2,3,4]

现在分词的时候，假设我们是逐词分词，那么加入BOS和EOS有什么好处？

我认为这些只有在使用自注意力解码器时才有用，对吧？因此，因为在这种情况下，对于解码器来说，输出必须右移输入，向量将是：

input_string = "The cat is black EOS"
input_tokens = [1,2,3,4,5]

shifted_output_string = "BOS The cat is black"
shifted_output_tokens = [6,1,2,3,4]

output_string = "The cat is black EOS"
output_token = [1,2,3,4,5]

然而，BERT 没有自注意力解码器，而是一个简单的前馈层。这就是为什么我不确定是否理解这些特殊令牌的用途。

总之，问题是：

即使您没有 Transformer 解码器，您是否始终需要 BOS 和 EOS 代币？
为什么没有 Transformer 解码器的 BERT 需要掩码语言模型部分的 SEP 令牌？

原文

I'm starting to wrap my head around the transformer architecture, but there are some things that I am not yet able to grasp.

In decoder-free transformers, such as BERT, the tokenizer includes always the tokens CLS and SEP before and after a sentence. I understand that CLS acts both as BOS and as a single hidden output that gives the classification information, but I am a bit lost about why does it need SEP for the masked language modeling part.

I'll explain a bit more about the utility I expect to get. In my case, I want to train a transformer to act as an autoencoder, so target = input. There would be no decoder, since my idea is to reduce the dimensionality of the original vocabulary into less embedding dimensions, and then study (not sure how yet, but will get there) the reduced space in order to extract useful information.

Therefore, an example would be:

string_input = "The cat is black" 
tokens_input =  [1,2,3,4]

string_target = "The cat is black"
tokens_output = [1,2,3,4]

Now when tokenizing, assuming that we tokenize in the basis of word by word, what would be the advantage of adding BOS and EOS?

I think these are only useful when you are using the self-attention decoder, right? so, since in that case, for the decoder the outputs would have to enter right-shifted, the vectors would be:

input_string = "The cat is black EOS"
input_tokens = [1,2,3,4,5]

shifted_output_string = "BOS The cat is black"
shifted_output_tokens = [6,1,2,3,4]

output_string = "The cat is black EOS"
output_token = [1,2,3,4,5]

However, BERT does not have a self-attention decoder, but a simple feedforward layer. That is why I'm not sure of understanding the purpose of these special tokens.

In summary, the questions would be:

Do you always need BOS and EOS tokens, even if you don't have a transformer decoder?
Why does BERT, that does not have a transformer decoder, require the SEP token for the masked language model part?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

葬﹪忆之殇 2025-01-17 15:38:20

首先，简单介绍一下 BERT -
BERT 词嵌入允许根据单词使用的上下文对同一个单词进行多种向量表示。从这个意义上说，BERT 嵌入是上下文相关的。 BERT 在计算其嵌入时明确获取句子中每个单词的索引位置。 BERT 的输入是一个句子而不是单个单词。这是因为 BERT 需要整个句子的上下文来确定句子中单词的向量。如果你只向 BERT 输入单个词向量，它将完全违背 BERT 双向、上下文性质的目的。然后输出是整个输入句子的固定长度向量表示。 BERT 为词汇表外的单词提供支持，因为该模型在“子词”级别学习单词（也称为“单词片段”）。

SEP 标记用于帮助 BERT 区分两个不同的单词序列。这在下一序列预测（NSP）中是必要的。 CLS 在 NSP 中也是必要的，以便让 BERT 知道第一个序列何时开始。理想情况下，您应该使用如下格式：

CLS [sequence 1] SEP [sequence 2] SEP

请注意，我们不使用任何BOS或EOS 代币。标准 BERT 分词器不包括这些。如果我们运行以下代码，我们可以看到这一点：

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer.eos_token)
print(tokenizer.bos_token)
print(tokenizer.sep_token)
print(tokenizer.cls_token)

输出：
没有任何
没有任何
[九月]
[CLS]

对于掩码语言建模 (MLM)，我们只关心 MASK 标记，因为模型的目标只是猜测掩码标记。

BERT 接受了 NSP 和 MLM 的训练，正是这两种训练方法的结合使 BERT 如此有效。

所以回答你的问题 - 你并不“总是需要”EOS 和/或 BOS。事实上，您根本“不需要”它们。但是，如果您正在针对特定的下游任务微调 BERT，并且您打算使用 BOS 和 EOS 代币（具体方式取决于您），那么是的，我想您会将它们作为特殊代币包含在内。但请注意，BERT 的训练并未考虑到这些因素，您可能会看到不可预测/不稳定的结果。

First, a little about BERT -
BERT word embeddings allow for multiple vector representations for the same word, based on the context in which the word was used. In this sense, BERT embeddings are context-dependent. BERT explicitly takes the index position of each word in the sentence while calculating its embedding. The input to BERT is a sentence rather than a single word. This is because BERT needs the context of the whole sentence to determine the vectors of the words in the sentence. If you only input a single word vector to BERT it would completely defeat the purpose of BERT’s bidirectional, contextual nature. The output is then a fixed-length vector representation of the whole input sentence. BERT provides support for out-of-vocabulary words because the model learns words at a “subword” level (also called “word-pieces”).

The SEP token is used to help BERT differentiate between two different word sequences. This is necessary in next-sequence-prediction (NSP). CLS is also necessary in NSP to let BERT know when the first sequence begins. Ideally you would use a format like this:

CLS [sequence 1] SEP [sequence 2] SEP

Note that we are not using any BOS or EOS tokens. The standard BERT tokenizer does not include these. We can see this if we run the following code:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
print(tokenizer.eos_token)
print(tokenizer.bos_token)
print(tokenizer.sep_token)
print(tokenizer.cls_token)

Output:
None
None
[SEP]
[CLS]

For masked-language-modeling (MLM), we are only concerned with the MASK token, since the model's objective is merely to guess the masked token.

BERT was trained on both NSP and MLM and it is the combination of those two training methods that make BERT so effective.

So to answer your questions - you do not "always need" EOS and/or BOS. In fact, you don't "need" them at all. However, if you are fine-tuning BERT for a specific downstream task, where you intent to use BOS and EOS tokens (the manner of which, is up to you), then yes I suppose you would include them as special tokens. But understand that BERT was not trained with those in mind and you may see unpredictable/unstable results.

回复收藏 0 原文

~没有更多了~