如何从 PDF 中提取数据?

发布于 2024-07-27 03:43:16 字数 169 浏览 3 评论 0原文

我的公司通过 Excel 从外部公司接收数据。 我们将其导出到 SQL Server 以运行数据报告。 他们现在正在更改为PDF格式,有没有办法可靠地从PDF中移植数据并将其插入到我们的SQL Server 2008数据库中?

这是否需要编写一个应用程序,或者是否有一种自动化的方法来做到这一点?

My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database?

Would this require writing an app or is there an automated way of doing this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

水晶透心 2024-08-03 03:43:16

正如已经提到的 - 您必须编写一个应用程序来执行此操作,但理想情况下您将能够从外部公司获取原始数据,而不必处理 PDF。

但是,如果您确实想要从 PDF 中提取数据,我使用了 iText 并发现它非常强大、可靠而且最重要的是 - 免费。 它具有 Java 和 .Net 风格 - iTextSharp 是 .Net 版本。 它允许您以编程方式操作 PDF 文档,并将 PDF 的内容公开给您编写的应用程序。

As already mentioned - you will have to write an app to do this, but ideally you would be able to get the raw data from the external company rather than having to process the PDF.

However, if you do want to extract the data from the PDF, I've used iText and found it to be very powerful, reliable and most importantly - free. It comes in Java and .Net flavours - iTextSharp is the .Net version. It allows you to programatically manipulate PDF documents and it will expose the contents of the PDF to the application that you write.

深陷 2024-08-03 03:43:16

这完全取决于他们如何将数据包含在 PDF 中。 一般来说,这里有两种可能的情况:

  1. 数据只是 PDF 中的文本对象。 您需要使用工具从 PDF 中提取文本,然后将其插入数据库中。

  2. 数据包含在 PDF 的表单字段中。 您需要使用工具从表单字段中提取数据并将其插入数据库中。

希望场景 #2 适用于您,因为这正是 PDF 表单的设计目的。 场景 #1 实际上只是一个 hack,只有在没有任何其他选择的情况下才会使用。 从 PDF 中提取纯文本并不像您想象的那么简单或准确。

如果您收到 PDF 表单,那么您所需要做的就是将 PDF 表单中的正确字段与数据库中的相应字段进行匹配,然后吸收数据。 如果您编写自己的应用程序,此过程可以完全自动化。

这是否需要编写一个应用程序或
有没有一种自动化的方法
这个?

是的,这两个选项都需要编写应用程序或购买应用程序。 如果您编写自己的应用程序,那么您需要找到一个支持从表单字段检索数据或从 PDF 中提取文本的第三方 PDF 库。

It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here:

  1. The data is just a text object within a PDF. You'll need to use a tool to extract the text from the PDF then insert it into your database.

  2. The data is contained within form fields in a PDF. You'll need to use a tool to extract data from the form fields and insert it into your database.

Hopefully scenario #2 applies to you because this is precisely what PDF forms are designed for. Scenario #1 is really just a hack that you'd only use if you didn't have any other options. Extracting plain text from a PDF isn't as easy or accurate as you might expect.

If you're receiving a PDF form then all you need to do is match up the right fields in the PDF form with the corresponding fields in your database and then suck in the data. This process could be entirely automated if you wrote your own application.

Would this require writing an app or
is there an automated way of doing
this?

Yes, both of these options would require writing an app or buying an app. If you write your own app then you'll need to find a third-party PDF library that supports retrieving data from form fields or extracting text from a PDF.

孤独陪着我 2024-08-03 03:43:16

免责声明:我隶属于 ByteScout PDF Extractor SDK 工具的制造商

只是想分享一些从 PDF 中提取文本数据的额外现实场景:

  1. 没有可搜索文本的扫描图像:应该由 OCR 引擎处理(例如免费的 Tesseract 来自 Google)
  2. XFA表单:它是 PDF 的子集,主要由 Adob​​e 工具支持。 但可以使用低级 PDF 处理工具(例如 iTextSharp 或类似工具)将数据提取为 XML 数据。
  3. ZUGFeRD PDF 文件,只是带有表单数据副本的 PDF 文档作为 XML 文件附加(可以使用 this 等工具提取)
  4. 某些 PDF 生成器错误编码的文本(可以是通过 OCR 引擎恢复,但有一些可接受的错误率)。

Disclaimer: I am affiliated with the makers of ByteScout PDF Extractor SDK tool

Just wanted to share some additional real-life scenarios for text data extraction from PDF:

  1. Scanned image with no searchable text: should be processed by OCR engine (like free Tesseract from Google)
  2. XFA forms: it is the subset of PDF which is supported mostly by Adobe tools. But the data can be extracted as XML data with low level PDF processing tools like iTextSharp or similar tools.
  3. ZUGFeRD PDF files which are just PDF documents with the copy of a form data attached as XML file (which can be extracted with tools like this)
  4. Text incorrectly encoded by some PDF generators (can be restored via OCR engine with some acceptable error rate though).
瑾夏年华 2024-08-03 03:43:16

使用 ItextSharp,执行以下操作

using System;
using System.Configuration;
using System.Data.SqlClient;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;

   protected void BtnSubmit_Click(object sender, EventArgs e)
        {
            String FilePath = @"GetFilePath";
            StringBuilder sb = new StringBuilder();

            PdfReader reader = new PdfReader(FilePath);           
            PdfStamper myStamp = new PdfStamper(reader, new FileStream(FilePath + "_TMP",    FileMode.Create));
            AcroFields form = myStamp.AcroFields;

            if (form.GetField("GetFieldIdFromPDF") != null)
               sb.Append(form.GetField("GetFieldIdFromPDF").ToString());
   }

Using ItextSharp, do the following

using System;
using System.Configuration;
using System.Data.SqlClient;
using System.IO;
using System.Text;
using iTextSharp.text.pdf;

   protected void BtnSubmit_Click(object sender, EventArgs e)
        {
            String FilePath = @"GetFilePath";
            StringBuilder sb = new StringBuilder();

            PdfReader reader = new PdfReader(FilePath);           
            PdfStamper myStamp = new PdfStamper(reader, new FileStream(FilePath + "_TMP",    FileMode.Create));
            AcroFields form = myStamp.AcroFields;

            if (form.GetField("GetFieldIdFromPDF") != null)
               sb.Append(form.GetField("GetFieldIdFromPDF").ToString());
   }
木槿暧夏七纪年 2024-08-03 03:43:16

我想你必须为此写一份申请。 这个问题讨论 从 PDF 中提取数据。 此后,您可以将数据导出为 Excel 格式,以便保留现有的导入格式。

I think you will have to write an application for this. This question talks about extracting data from PDF. After this you can export the data to excel format so that you can preserve the existing import format.

世态炎凉 2024-08-03 03:43:16

查找有关从 PDF 中“抓取”数据的信息。 我相信 Adob​​e 有一些工具可以让您对简单文本执行此操作,但我没有使用过它们。

但老实说,我会尽一切努力从供应商处获取原始格式的数据。

Look for information on "Scraping" the data from the PDF. I believe Adobe has some tools that allow you to do this for simple text but I've not used them.

Honestly though, I would try to do anything you can to get this data in a raw format from your vendor.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文