Azure形式识别器未在Databrick上找到与Python的内容

发布于 2025-01-31 01:35:28 字数 1892 浏览 6 评论 0 原文

我正在使用相关认知表格识别器库在数据映中执行以下python:

from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"

form_recognizer_client = FormRecognizerClient(endpoint, credential)

with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
    form = fd.read()

poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()

for content in form_pages:
    for table in content.tables:
        print("Table found on page {}:".format(table.page_number))
        print("Table location {}:".format(table.bounding_box))
        for cell in table.cells:
            print("Cell text: {}".format(cell.text))
            print("Location: {}".format(cell.bounding_box))
            print("Confidence score: {}\n".format(cell.confidence))

    if content.selection_marks:
        print("Selection marks found on page {}:".format(content.page_number))
        for selection_mark in content.selection_marks:
            print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
                selection_mark.state,
                selection_mark.bounding_box,
                selection_mark.confidence
            ))

PDF表格看起来如下:

库识别 单元文本:项目 单元文字:数量 手机文字:座位分配 单元文本:小计 手机文字:成人 单元文本:1 单元文本:D-11 单元文本:14.50

,但没有识别PDF中的以下文本:

您可以通过显示电子入场来直接进入屏幕 迎来。或者,您可以在票房收集门票 在电影的开始时间或 事件。您需要预订参考和/或付款卡来帮助我们 找到您的预订。您可以通过单击“打印此”来打印此页面 页面“上方链接。

是设计吗?还是我在代码中缺少某些内容?

I am executing the following Python on Databricks with the relevant Cognitive Form recognizer libraries:

from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"

form_recognizer_client = FormRecognizerClient(endpoint, credential)

with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
    form = fd.read()

poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()

for content in form_pages:
    for table in content.tables:
        print("Table found on page {}:".format(table.page_number))
        print("Table location {}:".format(table.bounding_box))
        for cell in table.cells:
            print("Cell text: {}".format(cell.text))
            print("Location: {}".format(cell.bounding_box))
            print("Confidence score: {}\n".format(cell.confidence))

    if content.selection_marks:
        print("Selection marks found on page {}:".format(content.page_number))
        for selection_mark in content.selection_marks:
            print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
                selection_mark.state,
                selection_mark.bounding_box,
                selection_mark.confidence
            ))

The pdf form looks like the following:

enter image description here

The libraries recognizes
Cell text: Item
Cell text: Qty
Cell text: Seat Allocation
Cell text: Subtotal
Cell text: Adult
Cell text: 1
Cell text: D-11
Cell text: 14.50

But it doesn't recognize the following text from the pdf:

You can go straight to the screen by showing your e-ticket to an
usher. Alternatively, you can collect your tickets at Box Office at
least 15 minutes before the advertised start time of the film or
event. You need your Booking Reference and/or payment card to help us
find your booking. You can print this page by clicking the "Print This
Page" link above.

Is that by design? Or am I missing something in my code?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吾性傲以野 2025-02-07 01:35:28

不幸的是,设计就是这样。 形式识别器正在研究预训练的模型,该模型可以识别文档的键值对,文本和表以及文件中的表中的表内容作为输入。即使该文件在中间或任何地方都包含大量文本和表内容,但也将被识别。

要了解更多详细信息,请参考此链接:

https://www.drware.com/extract-data-from-pdfs-pdfs-using-form-recognizer-withizer-with-with-code-ode-orwithout/

https://www.youtube.com/watch?v a>

https://github.com/tomweinandy/form_recognizer_demo

Unfortunately, the design is like that. The form recognizer is working on pre-trained models and that can recognize the key-value pairs, text, and tables from your documents and the table contents in the file uploaded as the input. Even though the file contains a large amount of text in paragraphs and table content in the middle or at any place, it will be recognized.

To know more details please Refer this link:

https://www.drware.com/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/

https://www.youtube.com/watch?v=iBQO4QdUp6A&t=10s

https://github.com/tomweinandy/form_recognizer_demo

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文