Azure搜索为什么在Merged_content字段中不正确顺序合并OCRED文本？

发布于 2025-02-06 10:27:00 字数 3258 浏览 3 评论 0 原文

我需要开发自己的WebAPI自定义技能，使我们成为读取API 。我将在我的自定义技能中使用它。我不能使用 Azure认知搜索的内置 OCR技能（

我的WebAPI技能的t输出看起来像这样：

// logic to get result...
// now creating output to custom skill

    var textUrlFileResults = results.AnalyzeResult.ReadResults;
    foreach (ReadResult page in textUrlFileResults)
    {
        var newValue = new
        {
            RecordId = value.RecordId,
            Data = new
            {
                text = string.Join(" ", page.Lines?.Select(x => x.Text))
            }
        };

        output.Values.Add(newValue);
    }
}



return new OkObjectResult(output);

这是我的 skill> skill> skillseet 定义：

  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "#1",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/text"
        },
        {
          "name": "offsets",
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "merged_content"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "#2",
      "description": null,
      "context": "/document/normalized_images/*",
      // i cut some info
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    }
  ],

我正在尝试使用看起来像这样的OCR PDF文档：

和索引中，我得到了这样的文档：

{
  "@odata.context": " cutted ",
  "value": [
    {
      "@search.score": 1,
      "content": "\nText before shell\n\nText after shell\n\nText after bw\n\n\n\n\n\n\n\nAnd here second page\n\n\n",
      "merged_content": "\nText before shell\n\nText after shell\n\nText after bw\n\n SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999 \n\n B+W BLACK+WHITE PHOTOGRAPHY \n\n\n\nAnd here second page\n\n\n",
      "text": [
        "SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999",
        "B+W BLACK+WHITE PHOTOGRAPHY"
      ],
      "layoutText": [],
      "textFromOcr": "[\"SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999\",\"B+W BLACK+WHITE PHOTOGRAPHY\"]"
    }
  ]
}

我的问题是，当我使用/document/document/normalized_images/*/时，为什么使用标准文本以正确的顺序将Ocred Text放置在正确的顺序上。 contentOffset“ in mergeskill ？说实话，我的技能集是从 ms docs ，并且它无法正常工作。我不太了解，什么特殊来自 ocr ocr技能。

原文

I need to develop my own webapi custom skill that make us of Read API. I will use it in my custom skillset. I can't use built-in OCR skill from Azure Cognitive Search (t

Output of my webapi skill looks like this:

// logic to get result...
// now creating output to custom skill

    var textUrlFileResults = results.AnalyzeResult.ReadResults;
    foreach (ReadResult page in textUrlFileResults)
    {
        var newValue = new
        {
            RecordId = value.RecordId,
            Data = new
            {
                text = string.Join(" ", page.Lines?.Select(x => x.Text))
            }
        };

        output.Values.Add(newValue);
    }
}



return new OkObjectResult(output);

And here is my skillset definition:

  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "#1",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/text"
        },
        {
          "name": "offsets",
          "source": "/document/normalized_images/*/contentOffset"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "merged_content"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
      "name": "#2",
      "description": null,
      "context": "/document/normalized_images/*",
      // i cut some info
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "text"
        }
      ]
    }
  ],

I am trying to OCR pdf document that look like this:

And in Index i get this document that looks like this:

{
  "@odata.context": " cutted ",
  "value": [
    {
      "@search.score": 1,
      "content": "\nText before shell\n\nText after shell\n\nText after bw\n\n\n\n\n\n\n\nAnd here second page\n\n\n",
      "merged_content": "\nText before shell\n\nText after shell\n\nText after bw\n\n SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999 \n\n B+W BLACK+WHITE PHOTOGRAPHY \n\n\n\nAnd here second page\n\n\n",
      "text": [
        "SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999",
        "B+W BLACK+WHITE PHOTOGRAPHY"
      ],
      "layoutText": [],
      "textFromOcr": "[\"SHELL 1900 1904 1909 1930 1948 SHELL SHELL Shell Shell 1955 1961 1971 1995 1999\",\"B+W BLACK+WHITE PHOTOGRAPHY\"]"
    }
  ]
}

My question is, why OCRed text is not placed in correct order with standard text when i am using /document/normalized_images/*/contentOffset" in MergeSkill? To be honest my skillset is copy-pasted from ms docs and it is not working as expected. I dont really understand, what special comes from OCR skill. I need to develop my own OCR skill, i can't use OCR from Search out of the box, i need to write it on my own.