如何从 PDF 中提取数据?
我的公司通过 Excel 从外部公司接收数据。 我们将其导出到 SQL Server 以运行数据报告。 他们现在正在更改为PDF格式,有没有办法可靠地从PDF中移植数据并将其插入到我们的SQL Server 2008数据库中?
这是否需要编写一个应用程序,或者是否有一种自动化的方法来做到这一点?
My company receives data from an external company via Excel. We export this into SQL Server to run reports on the data. They are now changing to PDF format, is there a way to reliably port the data from the PDF and insert it into our SQL Server 2008 database?
Would this require writing an app or is there an automated way of doing this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
正如已经提到的 - 您将必须编写一个应用程序来执行此操作,但理想情况下您将能够从外部公司获取原始数据,而不必处理 PDF。
但是,如果您确实想要从 PDF 中提取数据,我使用了 iText 并发现它非常强大、可靠
而且最重要的是 - 免费。 它具有 Java 和 .Net 风格 - iTextSharp 是 .Net 版本。 它允许您以编程方式操作 PDF 文档,并将 PDF 的内容公开给您编写的应用程序。As already mentioned - you will have to write an app to do this, but ideally you would be able to get the raw data from the external company rather than having to process the PDF.
However, if you do want to extract the data from the PDF, I've used iText and found it to be very powerful, reliable
and most importantly - free. It comes in Java and .Net flavours - iTextSharp is the .Net version. It allows you to programatically manipulate PDF documents and it will expose the contents of the PDF to the application that you write.这完全取决于他们如何将数据包含在 PDF 中。 一般来说,这里有两种可能的情况:
数据只是 PDF 中的文本对象。 您需要使用工具从 PDF 中提取文本,然后将其插入数据库中。
数据包含在 PDF 的表单字段中。 您需要使用工具从表单字段中提取数据并将其插入数据库中。
希望场景 #2 适用于您,因为这正是 PDF 表单的设计目的。 场景 #1 实际上只是一个 hack,只有在没有任何其他选择的情况下才会使用。 从 PDF 中提取纯文本并不像您想象的那么简单或准确。
如果您收到 PDF 表单,那么您所需要做的就是将 PDF 表单中的正确字段与数据库中的相应字段进行匹配,然后吸收数据。 如果您编写自己的应用程序,此过程可以完全自动化。
是的,这两个选项都需要编写应用程序或购买应用程序。 如果您编写自己的应用程序,那么您需要找到一个支持从表单字段检索数据或从 PDF 中提取文本的第三方 PDF 库。
It all depends on how they've included the data within the PDF. Generally speaking, there's two possible scenarios here:
The data is just a text object within a PDF. You'll need to use a tool to extract the text from the PDF then insert it into your database.
The data is contained within form fields in a PDF. You'll need to use a tool to extract data from the form fields and insert it into your database.
Hopefully scenario #2 applies to you because this is precisely what PDF forms are designed for. Scenario #1 is really just a hack that you'd only use if you didn't have any other options. Extracting plain text from a PDF isn't as easy or accurate as you might expect.
If you're receiving a PDF form then all you need to do is match up the right fields in the PDF form with the corresponding fields in your database and then suck in the data. This process could be entirely automated if you wrote your own application.
Yes, both of these options would require writing an app or buying an app. If you write your own app then you'll need to find a third-party PDF library that supports retrieving data from form fields or extracting text from a PDF.
免责声明:我隶属于 ByteScout PDF Extractor SDK 工具的制造商
只是想分享一些从 PDF 中提取文本数据的额外现实场景:
Disclaimer: I am affiliated with the makers of ByteScout PDF Extractor SDK tool
Just wanted to share some additional real-life scenarios for text data extraction from PDF:
使用 ItextSharp,执行以下操作
Using ItextSharp, do the following
我想你必须为此写一份申请。 这个问题讨论 从 PDF 中提取数据。 此后,您可以将数据导出为 Excel 格式,以便保留现有的导入格式。
I think you will have to write an application for this. This question talks about extracting data from PDF. After this you can export the data to excel format so that you can preserve the existing import format.
查找有关从 PDF 中“抓取”数据的信息。 我相信 Adobe 有一些工具可以让您对简单文本执行此操作,但我没有使用过它们。
但老实说,我会尽一切努力从供应商处获取原始格式的数据。
Look for information on "Scraping" the data from the PDF. I believe Adobe has some tools that allow you to do this for simple text but I've not used them.
Honestly though, I would try to do anything you can to get this data in a raw format from your vendor.