Azure形式识别器未在Databrick上找到与Python的内容
我正在使用相关认知表格识别器库在数据映中执行以下python:
from azure.ai.formrecognizer import FormRecognizerClient
from azure.core.credentials import AzureKeyCredential
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import FormRecognizerClient
credential = AzureKeyCredential("aaa6123af5b843a38044538d95584c3d")
endpoint= "https://myformrecognizr.cognitiveservices.azure.com/"
form_recognizer_client = FormRecognizerClient(endpoint, credential)
with open("/dbfs/mnt/lake/RAW/export/Picturehouse.pdf", "rb") as fd:
form = fd.read()
poller = form_recognizer_client.begin_recognize_content(form)
form_pages = poller.result()
for content in form_pages:
for table in content.tables:
print("Table found on page {}:".format(table.page_number))
print("Table location {}:".format(table.bounding_box))
for cell in table.cells:
print("Cell text: {}".format(cell.text))
print("Location: {}".format(cell.bounding_box))
print("Confidence score: {}\n".format(cell.confidence))
if content.selection_marks:
print("Selection marks found on page {}:".format(content.page_number))
for selection_mark in content.selection_marks:
print("Selection mark is '{}' within bounding box '{}' and has a confidence of {}".format(
selection_mark.state,
selection_mark.bounding_box,
selection_mark.confidence
))
PDF表格看起来如下:
库识别 单元文本:项目 单元文字:数量 手机文字:座位分配 单元文本:小计 手机文字:成人 单元文本:1 单元文本:D-11 单元文本:14.50
,但没有识别PDF中的以下文本:
您可以通过显示电子入场来直接进入屏幕 迎来。或者,您可以在票房收集门票 在电影的开始时间或 事件。您需要预订参考和/或付款卡来帮助我们 找到您的预订。您可以通过单击“打印此”来打印此页面 页面“上方链接。
是设计吗?还是我在代码中缺少某些内容?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
不幸的是,设计就是这样。 形式识别器正在研究预训练的模型,该模型可以识别文档的键值对,文本和表以及文件中的表中的表内容作为输入。即使该文件在中间或任何地方都包含大量文本和表内容,但也将被识别。
要了解更多详细信息,请参考此链接:
https://www.drware.com/extract-data-from-pdfs-pdfs-using-form-recognizer-withizer-with-with-code-ode-orwithout/
https://www.youtube.com/watch?v a>
https://github.com/tomweinandy/form_recognizer_demo
Unfortunately, the design is like that. The form recognizer is working on pre-trained models and that can recognize the key-value pairs, text, and tables from your documents and the table contents in the file uploaded as the input. Even though the file contains a large amount of text in paragraphs and table content in the middle or at any place, it will be recognized.
To know more details please Refer this link:
https://www.drware.com/extract-data-from-pdfs-using-form-recognizer-with-code-or-without/
https://www.youtube.com/watch?v=iBQO4QdUp6A&t=10s
https://github.com/tomweinandy/form_recognizer_demo