如何在Apache Spark上使用Databrick在输出中包括一个用于循环的Python

发布于 2025-02-04 18:01:40 字数 1148 浏览 3 评论 0原文

我正在尝试将A for循环的结果包括在我的代码中。

只是为您提供一些背景：

以下代码从PDF提取文本。然后，它将结果保存到数据框“ MyDF”中，

import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient

# field_list = ["result.content"]

document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", blob_url)
  result = poller.result()
  print("Scanning " + blob.name + "...")
  print ("document contains", result.content)
  for page in result.pages:
      print("----Analyzing Read from page #{}----".format(page.page_number))

mydf = result.content

我修改了代码以包含以下页面号：

for page in result.pages:
      print("----Analyzing Read from page #{}----".format(page.page_number))

问题是我不确定如何在输出中包含页码。正如我提到的那样，输出只是从页面中提取文本，但没有给我带代码的页码：

for page in result.pages:
      print("----Analyzing Read from page #{}----".format(page.page_number))

我相信这是我简单地忽略的。

有什么想法吗？

原文

I am trying to include the results of a for loop in my code to the output results.

Just to give you some background:

The following code extracts text from PDFs. It then saves the results to a dataframe "mydf"

import pandas as pd
from azure.ai.formrecognizer import DocumentAnalysisClient

# field_list = ["result.content"]

document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = document_analysis_client.begin_analyze_document_from_url(
            "prebuilt-read", blob_url)
  result = poller.result()
  print("Scanning " + blob.name + "...")
  print ("document contains", result.content)
  for page in result.pages:
      print("----Analyzing Read from page #{}----".format(page.page_number))

mydf = result.content

I modified the code to include page numbers with the following:

for page in result.pages:
      print("----Analyzing Read from page #{}----".format(page.page_number))

The problem is I'm not sure how to include the page numbers in the output. As I mentioned, the output is just extracting the text from the pages but not giving me the page numbers with the code:

for page in result.pages:
      print("----Analyzing Read from page #{}----".format(page.page_number))

I believe this is something that I have simply overlooked.

Any thoughts?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

維他命╮ 2025-02-11 18:01:40

如果您的存储帐户具有带有页码的PDFS发票数据，则可以使用 OCR形式识别器 否则，您可以使用 官方文档 。

如果您已连接到BLOB存储，请确保用 blob url。请替换。 /strong> 。

示例代码：

参考：

https://pypi.org/project/azure-ai-formrecognizer/

https://learn.microsoft.com/en-us /azure/applied-ai-services/form-cognizer/how-to-to-to/try-sdk-rest-api？pivots = programming-language-language-python＃分析layout

回复收藏 0 原文

~没有更多了~

关于作者

庆幸我还是我

暂无简介

文章

26 人气

关注发私信

15077827184

文章 0 评论 0

关注

遗失的美好

文章 0 评论 0

关注

离不开的别离

文章 0 评论 0

关注

3857621955

文章 0 评论 0

关注

懒猫

文章 0 评论 0

关注

洋洋洒洒

文章 0 评论 0

友情链接

文江博客

如何在Apache Spark上使用Databrick在输出中包括一个用于循环的Python

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

15077827184

遗失的美好

离不开的别离

3857621955

懒猫

洋洋洒洒

友情链接

如何在Apache Spark上使用Databrick在输出中包括一个用于循环的Python

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

15077827184

遗失的美好

离不开的别离

3857621955

懒猫

洋洋洒洒

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。