Hugginface Transferes Bert Tokenizer-找出哪些文档被截断了

发布于 2025-01-29 06:49:29 字数 530 浏览 5 评论 0原文

我使用的是从拥抱面的变换库来创建基于Bert的文本分类模型。为此，我将文档标记为真实，因为我的文档比允许的时间更长（512），因此我将截断设置为真。

我如何找出实际被截断了多少个文档？我认为长度（512）不是文档的字符或单词计数，因为令牌器将文档作为模型的输入准备。该文档会发生什么，并且有一种直接的方法可以检查其是否被截断？

这是我用来将文档引用的代码。

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased") 
model = BertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=7)
train_encoded =  tokenizer(X_train, padding=True, truncation=True, return_tensors="pt")

如果您对我的代码或问题有更多疑问，请随时提出。

原文

I am using the Transforms library from Huggingface to create a text classification model based on Bert. For this I tokenise my documents and I set truncation to be true as my documents are longer than allowed (512).

How can I find out how many documents are actually getting truncated? I don't think the length (512) is character or word count of the document, as the Tokenizer prepares the document as input for the model. What happens to the document and is there a straight forward way to check whether or not it gets truncated?

This is the code I use to tokenise the documents.

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased") 
model = BertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=7)
train_encoded =  tokenizer(X_train, padding=True, truncation=True, return_tensors="pt")

In case you have any more questions about my code or problem, feel free to ask.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

玩心态 2025-02-05 06:49:29

您的假设是正确的！

长度大于512的任何内容（假设您使用的是“ Distilbert-Base-Multlingual-capialutal-castialual-castialual-caster”）都可以通过truncation = true来截断。

快速解决方案不会截断和计数示例大于模型的最大输入长度：


train_encoded_no_trunc =  tokenizer(X_train, padding=True, truncation=False, return_tensors="pt")

count=0 

for doc in train_encoded_no_trunc.input_ids:
    if(doc>0).sum()> tokenizer.model_max_length: 
        count+=1
print("number of truncated docs: ",count)

your assumption is correct!

anything with a length larger than 512 ( assuming you are using "distilbert-base-multilingual-cased" ) is truncated by having truncation=True.

A quick solution would be not truncating and counting examples larger than the max input length of the model:


train_encoded_no_trunc =  tokenizer(X_train, padding=True, truncation=False, return_tensors="pt")

count=0 

for doc in train_encoded_no_trunc.input_ids:
    if(doc>0).sum()> tokenizer.model_max_length: 
        count+=1
print("number of truncated docs: ",count)

回复收藏 0 原文

~没有更多了~