说,我有三个示例句子:
s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "Bengali Mask Language Model for Bengali Language"
我可以批量进行类似:
batch = [[s[0], s[1]], [s[1], s[2]]]
现在,如果我在句子对上应用 bert 令牌,则如果长度超过该句子的方式,它会截断句子对句子对长度的终极总和符合 max_length
参数,该参数应该完成。这是我的意思:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForPreTraining.from_pretrained("bert-base-uncased")
encoded = tokenizer(batch, padding="max_length", truncation=True, max_length=10)["input_ids"]
decoded = tokenizer.batch_decode(encoded)
print(decoded)
>>>Output: ['[CLS] this model was pre [SEP] thank to all [SEP]', '[CLS] thank to all [SEP] bengali mask language model [SEP]']
我的问题是,截断
在这里如何工作,在这对句子中,每对句子的标记数不等?
?示例,在第一个示例中,输出'[cls]此模型是前[SEP]感谢所有[SEP]'
来自两个句子的令牌数量并非同样出现 ie [Cls] 4令牌[sep] 3令牌[sep] 。
Say, I have three sample sentences:
s0 = "This model was pretrained using a specific normalization pipeline available here!"
s1 = "Thank to all the people around,"
s2 = "Bengali Mask Language Model for Bengali Language"
I could make a batch like:
batch = [[s[0], s[1]], [s[1], s[2]]]
Now, if I apply the BERT tokenizer on the sentence pairs, it truncates the sentence pairs if the length exceeds in such a way that the ultimate sum of the sentence pairs' lengths meets the max_length
parameter, which was supposed to be done, okay. Here's what I meant:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForPreTraining.from_pretrained("bert-base-uncased")
encoded = tokenizer(batch, padding="max_length", truncation=True, max_length=10)["input_ids"]
decoded = tokenizer.batch_decode(encoded)
print(decoded)
>>>Output: ['[CLS] this model was pre [SEP] thank to all [SEP]', '[CLS] thank to all [SEP] bengali mask language model [SEP]']
My question is, how does the truncation
work here in the pair of sentences where the number of tokens from each sentence of each pair is not equal?
For example, in the first example output '[CLS] this model was pre [SEP] thank to all [SEP]'
number of tokens from the two sentences has not come equally i.e [CLS] 4 tokens [SEP] 3 tokens [SEP].
发布评论
评论(1)
有不同的 noreflow noreferrer“>
There are different truncation strategies you can choose from: