Hugginface Transferes Bert Tokenizer-找出哪些文档被截断了
我使用的是从拥抱面的变换库来创建基于Bert的文本分类模型。为此,我将文档标记为真实,因为我的文档比允许的时间更长(512),因此我将截断设置为真。
我如何找出实际被截断了多少个文档?我认为长度(512)不是文档的字符或单词计数,因为令牌器将文档作为模型的输入准备。该文档会发生什么,并且有一种直接的方法可以检查其是否被截断?
这是我用来将文档引用的代码。
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
model = BertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=7)
train_encoded = tokenizer(X_train, padding=True, truncation=True, return_tensors="pt")
如果您对我的代码或问题有更多疑问,请随时提出。
I am using the Transforms library from Huggingface to create a text classification model based on Bert. For this I tokenise my documents and I set truncation to be true as my documents are longer than allowed (512).
How can I find out how many documents are actually getting truncated? I don't think the length (512) is character or word count of the document, as the Tokenizer prepares the document as input for the model. What happens to the document and is there a straight forward way to check whether or not it gets truncated?
This is the code I use to tokenise the documents.
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-multilingual-cased")
model = BertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", num_labels=7)
train_encoded = tokenizer(X_train, padding=True, truncation=True, return_tensors="pt")
In case you have any more questions about my code or problem, feel free to ask.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的假设是正确的!
长度大于512的任何内容(假设您使用的是“ Distilbert-Base-Multlingual-capialutal-castialual-castialual-caster”)都可以通过
truncation = true
来截断。快速解决方案不会截断和计数示例大于模型的最大输入长度:
your assumption is correct!
anything with a length larger than 512 ( assuming you are using "distilbert-base-multilingual-cased" ) is truncated by having
truncation=True
.A quick solution would be not truncating and counting examples larger than the max input length of the model: