如何在拥抱表令牌中应用Max_length从左侧截断令牌序列?
在HuggingFace Tokenizer中,应用max_length
参数指定令牌化文本的长度。我相信,它通过从 right 中切割多余的令牌,将序列截断为max_length-2
(如果truncation = true
)。为了进行分类,我需要从左> 中切掉多余的令牌,即序列的开始,以保留最后一个令牌。我该怎么做?
from transformers import AutoTokenizer
train_texts = ['text 1', ...]
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
encodings = tokenizer(train_texts, max_length=128, truncation=True)
In the HuggingFace tokenizer, applying the max_length
argument specifies the length of the tokenized text. I believe it truncates the sequence to max_length-2
(if truncation=True
) by cutting the excess tokens from the right. For the purposes of utterance classification, I need to cut the excess tokens from the left, i.e. the start of the sequence in order to preserve the last tokens. How can I do that?
from transformers import AutoTokenizer
train_texts = ['text 1', ...]
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
encodings = tokenizer(train_texts, max_length=128, truncation=True)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Tokenizer具有
truncation_side
参数,该参数应该设置为此。请参阅 docs 。
Tokenizers have a
truncation_side
parameter that should set exactly this.See the docs.
晚答案:
突变
预处理tokenizer.truncation_side
属性对我有用。Late answer:
Mutating the
PreTrainedTokenizer.truncation_side
attribute worked for me.我写了一个解决方案,这不是很健壮。仍在寻找更好的方法。用代码中提到的模型对此进行了测试。
I wrote a solution, which is not very robust. Still looking for a better way. This is tested with the models mentioned in code.