从PDF中提取识别器的大规模从PDF中提取数据:HTTPRESPONSEERROR :( FailedTodownLoadImage)无法从DataBricks上的输入URL下载映像

发布于 2025-01-31 03:52:25 字数 1453 浏览 4 评论 0原文

我正在尝试使用Azure形式识别器从PDF中提取数据。我正在使用 github

输入代码:

import pandas as pd

field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
  invoices = poller.result()
  print("Scanning " + blob.name + "...")
  
  for idx, invoice in enumerate(invoices):
      single_df = pd.DataFrame(columns=field_list)

      for field in field_list:
        entry = invoice.fields.get(field)
        
        if entry:
          single_df[field] = [entry.value]
          
      single_df['FileName'] = blob.name
      df = df.append(single_df)

df = df.reset_index(drop=True)
df

我已经 以下错误:

httpresponseerror :( failedTodownLoadImage)无法从输入URL下载映像。

我的URL看起来如下:

https://blobpretbiukblbdev.blob.core.windows.net/demo?sp=racwdl&st=2022-05-21T19:39:07Z&se=2022-05-22T03:39:07Z&sv=2020-08-04&sr=c&sig=XYhdecG2jKF8aNPPpkcP%2FCGVVRKYTFPrOQYdNDsASCA%3D/pdf1.pdf

NB: 钥匙已经重新生成,我刚刚留下了钥匙,因为它将出现在我的代码中以供插图。

我会在哪里出错?

I am trying to extract data from pdfs at scale with Azure Form Recognizer. I am using the code example at github

I have entered the code as follows:

import pandas as pd

field_list = ["InvoiceId", "VendorName", "VendorAddress", "CustomerName", "CustomerAddress", "CustomerAddressRecipient", "InvoiceDate", "InvoiceTotal", "DueDate"]
df = pd.DataFrame(columns=field_list)

for blob in container.list_blobs():
  blob_url = container_url + "/" + blob.name
  poller = form_recognizer_client.begin_recognize_invoices_from_url(invoice_url=blob_url)
  invoices = poller.result()
  print("Scanning " + blob.name + "...")
  
  for idx, invoice in enumerate(invoices):
      single_df = pd.DataFrame(columns=field_list)

      for field in field_list:
        entry = invoice.fields.get(field)
        
        if entry:
          single_df[field] = [entry.value]
          
      single_df['FileName'] = blob.name
      df = df.append(single_df)

df = df.reset_index(drop=True)
df

However, I keep on getting the following error:

HttpResponseError: (FailedToDownloadImage) Failed to download image from input URL.

My URL looks like the following:

https://blobpretbiukblbdev.blob.core.windows.net/demo?sp=racwdl&st=2022-05-21T19:39:07Z&se=2022-05-22T03:39:07Z&sv=2020-08-04&sr=c&sig=XYhdecG2jKF8aNPPpkcP%2FCGVVRKYTFPrOQYdNDsASCA%3D/pdf1.pdf

NB:
The key has been regenerated, I have just left the key in as it will appear in my code for illustration.

Where might I be going wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

花落人断肠 2025-02-07 03:52:25

As mentioned in REST API supportive documentation, there is a need to specify the Content-Type. There is a need to set the public access to source via JSON file. Set the Content-Type to application/pdf. To make this work, there is a need to install filetype package using link

pip install filetype

Check this link for better implementation of REST API to user Form Recognizer SDK.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文