类似FUNSD的数据集上的Finetuning Layoutlm-自我索引不超出范围
我正在尝试huggingface
变形金刚Microsoft/Layoutlmv2-base-uncased
通过automodelfortokencencecience
在我的自定义数据集上类似于funsd(pre funsd) - 处理和归一化)。经过几次训练后,我得到了此错误:
Traceback (most recent call last):
File "layoutlmV2/train.py", line 137, in <module>
trainer.train()
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1409, in train
return inner_training_loop(
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1651, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2345, in training_step
loss = self.compute_loss(model, inputs)
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2377, in compute_loss
outputs = model(**inputs)
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1228, in forward
outputs = self.layoutlmv2(
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 902, in forward
text_layout_emb = self._calc_text_embeddings(
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 753, in _calc_text_embeddings
spatial_position_embeddings = self.embeddings._calc_spatial_position_embeddings(bbox)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 93, in _calc_spatial_position_embeddings
h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "..../lib/python3.8/site-packages/torch/nn/functional.py", line 2203, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
经过进一步检查(词汇大小,bbox,尺寸,类...)。尽管成功的先前迭代的输入张量仅具有未签名的整数。这些负数由_calc_spatial_position_embeddings(self,bbox)
in mode> modeming_layoutlmv2.py
第92行92:
h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
- 什么可能导致返回的输入值为负面?
- 我该怎么做才能防止此错误发生?
输入张量的示例触发torch.embedding中的错误(重量,输入,padding_idx,scale_grad_by_freq,稀疏)
:
tensor([[ 0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12,
12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 5, 5, 5, 5, 5, 5, -6, -6, -6, -6, -6, -6, 1, 1, 1, 1, 1,
5, 5, 5, 5, 5, 5, 7, 5, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])
I'm experimenting with huggingface
transformers to finetune microsoft/layoutlmv2-base-uncased
through AutoModelForTokenClassification
on my custom dataset that is similar to FUNSD (pre-processed and normalized). After a few iterations of training I get this error :
Traceback (most recent call last):
File "layoutlmV2/train.py", line 137, in <module>
trainer.train()
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1409, in train
return inner_training_loop(
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1651, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2345, in training_step
loss = self.compute_loss(model, inputs)
File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2377, in compute_loss
outputs = model(**inputs)
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1228, in forward
outputs = self.layoutlmv2(
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 902, in forward
text_layout_emb = self._calc_text_embeddings(
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 753, in _calc_text_embeddings
spatial_position_embeddings = self.embeddings._calc_spatial_position_embeddings(bbox)
File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 93, in _calc_spatial_position_embeddings
h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
return forward_call(*input, **kwargs)
File "..../lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
return F.embedding(
File "..../lib/python3.8/site-packages/torch/nn/functional.py", line 2203, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
After further inspection (vocab size, bboxes, dimensions, classes...) I noticed that there's negative values inside the input tensor causing the error. While input tensors of successful previous iterations have unsigned integers only. These negative numbers are returned by _calc_spatial_position_embeddings(self, bbox)
in modeling_layoutlmv2.py
line 92 :
h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
- What may cause the returned input values to be negative?
- What could I do to prevent this error from happening?
Example of the input tensor that triggers the error in torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
:
tensor([[ 0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 9, 9, 9, 9, 9, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12,
12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 12, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 5, 5, 5, 5, 5, 5, -6, -6, -6, -6, -6, -6, 1, 1, 1, 1, 1,
5, 5, 5, 5, 5, 5, 7, 5, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0]])
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
仔细检查数据集,特别是标签的坐标后,我发现一些行
bbox
坐标导致宽度或高度为零。这是一个简化的示例:从数据集中删除这些标签后,解决了问题。
After double checking the dataset and specifically the coordinates of the labels, I've found that some rows
bbox
coordinates lead to zero width or height. Here's a simplified example:After removing these labels from the dataset, the issue was resolved.
更一般的问题是打破任何标准,例如超出图像的范围。这是在将bbox和单词传递给嵌入之前删除任何非法框和关联单词的代码。它假设您有两个有序列表,分别具有标准化的边界框和相关的单词。它可能并不详尽。
使用诸如Paddleocr之类的工具更有可能产生这些非常规的边界框,因为该工具可以在查找垂直文本时返回比Pytesseract EG更广泛的边界框。
盒子必须在x1,y1,x2,y2格式或左上,边界框的右下角,其中(0,0)是图像的左上角。
注意:以前吸引我的一件事是,这意味着y坐标必须倒置,即y =距离图像顶部的距离。
The more general problem is of breaking any of the criteria such as out of bounds of image. Here is the code to remove any illegal boxes and associated words prior to passing the bboxes and words to the embeddings. It assumes you have two ordered lists, with normalized bounding boxes and associated words respectively. It may not be exhaustive.
Using tools such as PaddleOCR is more likely to produce these unconventional bounding boxes as the tool can return a wider variety of bounding boxes than PyTesseract e.g. when finding vertical text.
Boxes must be in x1,y1,x2,y2 format or top-left, bottom-right corners of the bounding box where (0,0) is top left corner of image.
Note: One thing that caught me out before is that this means y coordinates must be inverted i.e. y = distance from top of image.