当前位置：文江博客话题详情

Huggingface 预训练模型的分词器和模型对象具有不同的最大输入长度

发布于 2025-01-18 03:23:39 字数 1501 浏览 1 评论 0 原文

我正在使用 Huggingface 的 symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli 预训练模型。我的任务需要在相当大的文本上使用它，因此了解最大输入长度至关重要。

以下代码应该加载预训练模型及其标记生成器：

encoding_model_name = "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"
encoding_tokenizer = AutoTokenizer.from_pretrained(encoding_model_name)
encoding_model = SentenceTransformer(encoding_model_name)

因此，当我打印有关它们的信息时：

encoding_tokenizer
encoding_model

我得到：

PreTrainedTokenizerFast(name_or_path='symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli', vocab_size=250002, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

如您所见，标记生成器中的 model_max_len=512 参数与 <模型中的strong>max_seq_length=128参数

我怎样才能确定哪一个是正确的？或者，可能，如果它们以某种方式响应不同的特征，我如何检查模型的最大输入长度？

原文

I'm using symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli pretrained model from huggingface. My task requires to use it on pretty large texts, so it's essential to know maximum input length.

The following code is supposed to load pretrained model and its tokenizer:

encoding_model_name = "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli"
encoding_tokenizer = AutoTokenizer.from_pretrained(encoding_model_name)
encoding_model = SentenceTransformer(encoding_model_name)

So, when I print info about them:

encoding_tokenizer
encoding_model

I'm getting:

PreTrainedTokenizerFast(name_or_path='symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli', vocab_size=250002, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False)})

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

As you can see, model_max_len=512 parameter in tokenizer doesn't match max_seq_length=128 parameter in model

How can I figure out which one is true? Or, probably, if they somehow respond to different features, how I can check maximum input length for my model?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

故事还在继续 2025-01-25 03:23:40

Since you are using a SentenceTransformer and load it to the SentenceTransformer class, it will truncate your input at 128 tokens as stated by the documentation (the relevant code is

属性max_seq_length
属性以获取模型的最大输入序列长度。

将截断更长的输入。

您也可以自己检查一下：

fifty = model.encode(["This "*50], convert_to_tensor=True)
two_hundered = model.encode(["This "*200], convert_to_tensor=True)
four_hundered = model.encode(["This "*400], convert_to_tensor=True)

print(torch.allclose(fifty, two_hundered))
print(torch.allclose(two_hundered,four_hundered))

输出：

False
True

基础模型（XLM-Roberta-base）能够处理最多512个令牌的序列，但是我假设 symanto 将其限制为128，因为它们在训练过程中也使用了此限制（即嵌入可能对序列的序列可能不符合128个令牌）。

Since you are using a SentenceTransformer and load it to the SentenceTransformer class, it will truncate your input at 128 tokens as stated by the documentation (the relevant code is here):

property max_seq_length
Property to get the maximal input sequence length for the model. Longer inputs will be truncated.

You can also check this by yourself:

fifty = model.encode(["This "*50], convert_to_tensor=True)
two_hundered = model.encode(["This "*200], convert_to_tensor=True)
four_hundered = model.encode(["This "*400], convert_to_tensor=True)

print(torch.allclose(fifty, two_hundered))
print(torch.allclose(two_hundered,four_hundered))

Output:

False
True

The underlying model (xlm-roberta-base) is able to handle sequences with up to 512 tokens, but I assume Symanto limited it to 128 because they also used this limit during training (i.e. the embeddings might be not good for sequences longer than 128 tokens).

回复收藏 0 原文

痴情换悲伤 2025-01-25 03:23:40

model_max_length 是模型可以采用的位置嵌入的最大长度。要检查这一点，请执行以下操作
打印（模型.config）
您会看到 "max_position_embeddings": 512 以及其他配置。

如何检查我的模型的最大输入长度？

当您对文本序列进行编码时，您可以传递 max_length （您的模型可以接受的长度）：
tokenizer.encode(txt, max_length=512)

回复收藏 0 原文

淡笑忘祈一世凡恋 2025-01-25 03:23:40

来自句子变压器的摘录：

# Input Sequence Length
# Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. 
# This limits transformers to inputs of certain lengths. A common value for BERT & Co. are 512 word pieces, which corresponds to about 300-400 words (for English). 
# Longer texts than this are truncated to the first x word pieces.

# By default, the provided methods use a limit of 128 word pieces, longer inputs will be truncated. 
# You can get and set the maximal sequence length like this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

print("Max Sequence Length:", model.max_seq_length)

# Change the length to 200
model.max_seq_length = 200

print("Max Sequence Length:", model.max_seq_length)

#Note: You cannot increase the length higher than what is maximally supported by the respective transformer model. 
# Also note that if a model was trained on short texts, the representations for long texts might not be that good.

链接 -

An extract from the Sentence Transformer:

# Input Sequence Length
# Transformer models like BERT / RoBERTa / DistilBERT etc. the runtime and the memory requirement grows quadratic with the input length. 
# This limits transformers to inputs of certain lengths. A common value for BERT & Co. are 512 word pieces, which corresponds to about 300-400 words (for English). 
# Longer texts than this are truncated to the first x word pieces.

# By default, the provided methods use a limit of 128 word pieces, longer inputs will be truncated. 
# You can get and set the maximal sequence length like this:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

print("Max Sequence Length:", model.max_seq_length)

# Change the length to 200
model.max_seq_length = 200

print("Max Sequence Length:", model.max_seq_length)

#Note: You cannot increase the length higher than what is maximally supported by the respective transformer model. 
# Also note that if a model was trained on short texts, the representations for long texts might not be that good.

Link - https://www.sbert.net/examples/applications/computing-embeddings/README.html#input-sequence-length

回复收藏 0 原文

~没有更多了~