如何在Apache Spark上使用Databrick在输出中包括一个用于循环的Python
我正在尝试将A for循环的结果包括在我的代码中。
只是为您提供一些背景:
以下代码从PDF提取文本。然后,它将结果保存到数据框“ MyDF”中,
import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient
# field_list = ["result.content"]
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = document_analysis_client.begin_analyze_document_from_url(
"prebuilt-read", blob_url)
result = poller.result()
print("Scanning " + blob.name + "...")
print ("document contains", result.content)
for page in result.pages:
print("----Analyzing Read from page #{}----".format(page.page_number))
mydf = result.content
我修改了代码以包含以下页面号:
for page in result.pages:
print("----Analyzing Read from page #{}----".format(page.page_number))
问题是我不确定如何在输出中包含页码。正如我提到的那样,输出只是从页面中提取文本,但没有给我带代码的页码:
for page in result.pages:
print("----Analyzing Read from page #{}----".format(page.page_number))
我相信这是我简单地忽略的。
有什么想法吗?
I am trying to include the results of a for loop in my code to the output results.
Just to give you some background:
The following code extracts text from PDFs. It then saves the results to a dataframe "mydf"
import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient
# field_list = ["result.content"]
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
for blob in container.list_blobs():
blob_url = container_url + "/" + blob.name
poller = document_analysis_client.begin_analyze_document_from_url(
"prebuilt-read", blob_url)
result = poller.result()
print("Scanning " + blob.name + "...")
print ("document contains", result.content)
for page in result.pages:
print("----Analyzing Read from page #{}----".format(page.page_number))
mydf = result.content
I modified the code to include page numbers with the following:
for page in result.pages:
print("----Analyzing Read from page #{}----".format(page.page_number))
The problem is I'm not sure how to include the page numbers in the output. As I mentioned, the output is just extracting the text from the pages but not giving me the page numbers with the code:
for page in result.pages:
print("----Analyzing Read from page #{}----".format(page.page_number))
I believe this is something that I have simply overlooked.
Any thoughts?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您的存储帐户具有带有页码的PDFS发票数据,则可以使用 OCR形式识别器 否则,您可以使用 官方文档 。
如果您已连接到BLOB存储,请确保用
blob url。
请替换。 /strong> 。示例代码:
参考:
https://pypi.org/project/azure-ai-formrecognizer/
https://learn.microsoft.com/en-us /azure/applied-ai-services/form-cognizer/how-to-to-to/try-sdk-rest-api?pivots = programming-language-language-python#分析layout
If your storage account has pdfs invoice data with page numbers, you can manually label page numbers using OCR Form Recognizer otherwise you can use official document.
If you are connected to blob storage, make sure to replace
formUrl
withblob URL.
Please follow below reference it has detail explanation about Azure Form Recognizer.Sample Code:
Reference:
https://pypi.org/project/azure-ai-formrecognizer/
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/how-to-guides/try-sdk-rest-api?pivots=programming-language-python#analyze-layout
https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-layout