截断如何在拥抱面上的一批句子对上应用Bert令牌时?

发布于 2025-01-29 04:10:37 字数 1085 浏览 5 评论 0 原文

说,我有三个示例句子:

s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "Bengali Mask Language Model for Bengali Language"

我可以批量进行类似:

batch = [[s[0], s[1]], [s[1], s[2]]]

现在,如果我在句子对上应用 bert 令牌,则如果长度超过该句子的方式,它会截断句子对句子对长度的终极总和符合 max_length 参数,该参数应该完成。这是我的意思:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForPreTraining.from_pretrained("bert-base-uncased")

encoded = tokenizer(batch, padding="max_length", truncation=True, max_length=10)["input_ids"]
decoded = tokenizer.batch_decode(encoded)
print(decoded)

>>>Output: ['[CLS] this model was pre [SEP] thank to all [SEP]', '[CLS] thank to all [SEP] bengali mask language model [SEP]']

我的问题是,截断在这里如何工作,在这对句子中,每对句子的标记数不等?

?示例,在第一个示例中,输出'[cls]此模型是前[SEP]感谢所有[SEP]'来自两个句子的令牌数量并非同样出现 ie [Cls] 4令牌[sep] 3令牌[sep]

Say, I have three sample sentences:

s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "Bengali Mask Language Model for Bengali Language"

I could make a batch like:

batch = [[s[0], s[1]], [s[1], s[2]]]

Now, if I apply the BERT tokenizer on the sentence pairs, it truncates the sentence pairs if the length exceeds in such a way that the ultimate sum of the sentence pairs' lengths meets the max_length parameter, which was supposed to be done, okay. Here's what I meant:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForPreTraining.from_pretrained("bert-base-uncased")

encoded = tokenizer(batch, padding="max_length", truncation=True, max_length=10)["input_ids"]
decoded = tokenizer.batch_decode(encoded)
print(decoded)

>>>Output: ['[CLS] this model was pre [SEP] thank to all [SEP]', '[CLS] thank to all [SEP] bengali mask language model [SEP]']

My question is, how does the truncation work here in the pair of sentences where the number of tokens from each sentence of each pair is not equal?

For example, in the first example output '[CLS] this model was pre [SEP] thank to all [SEP]' number of tokens from the two sentences has not come equally i.e [CLS] 4 tokens [SEP] 3 tokens [SEP].

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

风尘浪孓 2025-02-05 04:10:38

有不同的 noreflow noreferrer“>

  • true 'longest_first':截断为使用参数max_length指定的最大长度,或者如果未提供该参数,则截断为模型的最大可接受输入长度。如果提供了一对序列(或一批对),则将通过令牌截断令牌,从对中最长的序列中删除令牌。
  • 'Ollath_first':如果未提供该参数,则用参数max_length指定的最大长度或模型的最大可接受输入长度。如果提供了一对序列(或一批对),则只会截断对的第一个序列。
  • 'Olly_second':如果未提供该参数,则用参数max_length指定的最大长度或模型的最大可接受输入长度。如果提供了一对序列(或一批对),这将仅截断对的第二个序列。
  • false 'do_not_truncate'(默认):无截断(即,可以输出序列长度大于最大可允许输入大小的序列长度的输出批次)。

There are different truncation strategies you can choose from:

  • True or 'longest_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will truncate token by token, removing a token from the longest sequence in the pair if a pair of sequences (or a batch of pairs) is provided.
  • 'only_first': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the first sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
  • 'only_second': Truncate to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided. This will only truncate the second sequence of a pair if a pair of sequences (or a batch of pairs) is provided.
  • False or 'do_not_truncate' (default): No truncation (i.e., can output batch with sequence lengths greater than the model maximum admissible input size).
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文