类似FUNSD的数据集上的Finetuning Layoutlm-自我索引不超出范围

发布于 2025-02-10 17:53:11 字数 5173 浏览 2 评论 0原文

我正在尝试huggingface变形金刚Microsoft/Layoutlmv2-base-uncased通过automodelfortokencencecience在我的自定义数据集上类似于funsd（pre funsd） - 处理和归一化）。经过几次训练后，我得到了此错误：

 Traceback (most recent call last):
  File "layoutlmV2/train.py", line 137, in <module>
    trainer.train()
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1409, in train
    return inner_training_loop(
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1651, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2345, in training_step
    loss = self.compute_loss(model, inputs)
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2377, in compute_loss
    outputs = model(**inputs)
  File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1228, in forward
    outputs = self.layoutlmv2(
  File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 902, in forward
    text_layout_emb = self._calc_text_embeddings(
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 753, in _calc_text_embeddings
    spatial_position_embeddings = self.embeddings._calc_spatial_position_embeddings(bbox)
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 93, in _calc_spatial_position_embeddings
    h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
  File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "..../lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "..../lib/python3.8/site-packages/torch/nn/functional.py", line 2203, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

经过进一步检查（词汇大小，bbox，尺寸，类...）。尽管成功的先前迭代的输入张量仅具有未签名的整数。这些负数由_calc_spatial_position_embeddings（self，bbox） in mode> modeming_layoutlmv2.py

第92行92：

h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])

什么可能导致返回的输入值为负面？
我该怎么做才能防止此错误发生？

输入张量的示例触发torch.embedding中的错误（重量，输入，padding_idx，scale_grad_by_freq，稀疏）：

tensor([[ 0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11,  9,  9,  9,  9,  9,  9,  9,  9,  9,
          9,  9,  9,  9,  9,  9,  9, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12,
         12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
         10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
          8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
          8,  5,  5,  5,  5,  5,  5, -6, -6, -6, -6, -6, -6,  1,  1,  1,  1,  1,
          5,  5,  5,  5,  5,  5,  7,  5,  7,  7,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0]])

原文

I'm experimenting with huggingface transformers to finetune microsoft/layoutlmv2-base-uncased through AutoModelForTokenClassification on my custom dataset that is similar to FUNSD (pre-processed and normalized). After a few iterations of training I get this error :

 Traceback (most recent call last):
  File "layoutlmV2/train.py", line 137, in <module>
    trainer.train()
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1409, in train
    return inner_training_loop(
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 1651, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2345, in training_step
    loss = self.compute_loss(model, inputs)
  File "..../lib/python3.8/site-packages/transformers/trainer.py", line 2377, in compute_loss
    outputs = model(**inputs)
  File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 1228, in forward
    outputs = self.layoutlmv2(
  File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 902, in forward
    text_layout_emb = self._calc_text_embeddings(
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 753, in _calc_text_embeddings
    spatial_position_embeddings = self.embeddings._calc_spatial_position_embeddings(bbox)
  File "..../lib/python3.8/site-packages/transformers/models/layoutlmv2/modeling_layoutlmv2.py", line 93, in _calc_spatial_position_embeddings
    h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])
  File "..../lib/python3.8/site-packages/torch/nn/modules/module.py", line 1131, in _call_impl
    return forward_call(*input, **kwargs)
  File "..../lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 158, in forward
    return F.embedding(
  File "..../lib/python3.8/site-packages/torch/nn/functional.py", line 2203, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

After further inspection (vocab size, bboxes, dimensions, classes...) I noticed that there's negative values inside the input tensor causing the error. While input tensors of successful previous iterations have unsigned integers only. These negative numbers are returned by _calc_spatial_position_embeddings(self, bbox) in modeling_layoutlmv2.py

line 92 :

h_position_embeddings = self.h_position_embeddings(bbox[:, :, 3] - bbox[:, :, 1])

What may cause the returned input values to be negative?
What could I do to prevent this error from happening?

Example of the input tensor that triggers the error in torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) :

tensor([[ 0, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11,  9,  9,  9,  9,  9,  9,  9,  9,  9,
          9,  9,  9,  9,  9,  9,  9, 10, 10, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12,
         12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
         10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 12, 12, 12, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
         11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12,
         12, 12, 12, 12, 12, 12, 12, 12,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
          8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  8,
          8,  5,  5,  5,  5,  5,  5, -6, -6, -6, -6, -6, -6,  1,  1,  1,  1,  1,
          5,  5,  5,  5,  5,  5,  7,  5,  7,  7,  0,  0,  0,  0,  0,  0,  0,  0,
          0,  0,  0,  0,  0,  0,  0,  0]])

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

￡噩梦荏苒 2025-02-17 17:53:11

仔细检查数据集，特别是标签的坐标后，我发现一些行bbox坐标导致宽度或高度为零。这是一个简化的示例：

x1, y1, x2, y2 = dataset_row["bbox"]
print((x2-x1 < 1) or (y2-y1 < 1)) #output is sometimes True

从数据集中删除这些标签后，解决了问题。

After double checking the dataset and specifically the coordinates of the labels, I've found that some rows bbox coordinates lead to zero width or height. Here's a simplified example:

x1, y1, x2, y2 = dataset_row["bbox"]
print((x2-x1 < 1) or (y2-y1 < 1)) #output is sometimes True

After removing these labels from the dataset, the issue was resolved.

回复收藏 0 原文

新雨望断虹 2025-02-17 17:53:11

更一般的问题是打破任何标准，例如超出图像的范围。这是在将bbox和单词传递给嵌入之前删除任何非法框和关联单词的代码。它假设您有两个有序列表，分别具有标准化的边界框和相关的单词。它可能并不详尽。

使用诸如Paddleocr之类的工具更有可能产生这些非常规的边界框，因为该工具可以在查找垂直文本时返回比Pytesseract EG更广泛的边界框。

盒子必须在x1，y1，x2，y2格式或左上，边界框的右下角，其中（0,0）是图像的左上角。

注意：以前吸引我的一件事是，这意味着y坐标必须倒置，即y =距离图像顶部的距离。

    for enum, box in enumerate(boxes_norm):
    if (
        box[0] >= box[2] # left coordinate actually on the right
        or box[1] >= box[3] # bottom coordinate actually on top
        or box[0] < 0 # off the page
        or box[1] < 0 # off the page
        or box[2] < 0 # off the page
        or box[3] < 0 # off the page
        or box[2] > 1000 # off the page
        or box[3] > 1000 # off the page
        or box[0] > 1000 # off the page
        or box[1] > 1000 # off the page
    ):
        # print(
        #     "removing invalid box and associated word from image - ",
        #     example["image_path"],
        # )
        # print("box - ", box)
        del boxes_norm[enum]
        del words[enum]

The more general problem is of breaking any of the criteria such as out of bounds of image. Here is the code to remove any illegal boxes and associated words prior to passing the bboxes and words to the embeddings. It assumes you have two ordered lists, with normalized bounding boxes and associated words respectively. It may not be exhaustive.

Using tools such as PaddleOCR is more likely to produce these unconventional bounding boxes as the tool can return a wider variety of bounding boxes than PyTesseract e.g. when finding vertical text.

Boxes must be in x1,y1,x2,y2 format or top-left, bottom-right corners of the bounding box where (0,0) is top left corner of image.

Note: One thing that caught me out before is that this means y coordinates must be inverted i.e. y = distance from top of image.

    for enum, box in enumerate(boxes_norm):
    if (
        box[0] >= box[2] # left coordinate actually on the right
        or box[1] >= box[3] # bottom coordinate actually on top
        or box[0] < 0 # off the page
        or box[1] < 0 # off the page
        or box[2] < 0 # off the page
        or box[3] < 0 # off the page
        or box[2] > 1000 # off the page
        or box[3] > 1000 # off the page
        or box[0] > 1000 # off the page
        or box[1] > 1000 # off the page
    ):
        # print(
        #     "removing invalid box and associated word from image - ",
        #     example["image_path"],
        # )
        # print("box - ", box)
        del boxes_norm[enum]
        del words[enum]

回复收藏 0 原文

~没有更多了~