Huggingface 预训练模型的分词器和模型对象具有不同的最大输入长度
我正在使用 Huggingface 的 symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli 预训练模型。我的任务需要在相当大的文本上使用它,因此了解最大输入长度至关重要。
以下代码应该加载预训练模型及其标记生成器:
encoding_model_name = "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"
encoding_tokenizer = AutoTokenizer.from_pretrained(encoding_model_name)
encoding_model = SentenceTransformer(encoding_model_name)
因此,当我打印有关它们的信息时:
encoding_tokenizer
encoding_model
我得到:
PreTrainedTokenizerFast(name_or_path='symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli', vocab_size=250002, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})
SentenceTransformer(
(0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
如您所见,标记生成器中的 model_max_len=512 参数与 <模型中的strong>max_seq_length=128参数
我怎样才能确定哪一个是正确的?或者,可能,如果它们以某种方式响应不同的特征,我如何检查模型的最大输入长度?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Since you are using a SentenceTransformer and load it to the SentenceTransformer class, it will truncate your input at 128 tokens as stated by the documentation (the relevant code is
您也可以自己检查一下:
输出:
基础模型(XLM-Roberta-base)能够处理最多512个令牌的序列,但是我假设 symanto 将其限制为128,因为它们在训练过程中也使用了此限制(即嵌入可能对序列的序列可能不符合128个令牌)。
Since you are using a SentenceTransformer and load it to the SentenceTransformer class, it will truncate your input at 128 tokens as stated by the documentation (the relevant code is here):
You can also check this by yourself:
Output:
The underlying model (xlm-roberta-base) is able to handle sequences with up to 512 tokens, but I assume Symanto limited it to 128 because they also used this limit during training (i.e. the embeddings might be not good for sequences longer than 128 tokens).
model_max_length 是模型可以采用的位置嵌入的最大长度。要检查这一点,请执行以下操作
打印(模型.config)
您会看到
"max_position_embeddings": 512
以及其他配置。当您对文本序列进行编码时,您可以传递 max_length (您的模型可以接受的长度):
tokenizer.encode(txt, max_length=512)
Model_max_length is the maximum length of positional embedding the model can take. To check this, do
print(model.config)
you'll see
"max_position_embeddings": 512
along with other configs.You can pass the max_length(as much as your model can take) when you're encoding the text sequences:
tokenizer.encode(txt, max_length=512)
来自句子变压器的摘录:
链接 -
An extract from the Sentence Transformer:
Link - https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length