以编程方式提取 PDF 表格
我有一堆包含表格数据的 PDF 文档,我需要将它们提取为更易读的格式以存储在电子表格、数据库或其他内容中。
世界上有没有什么东西(最好是免费的)能够将 PDF 中的表格数据转换为更易读的格式批量,无论是与应用程序本机集成还是被动地集成通过命令行或在代码(.net)中循环进程?
实际上可以是任何格式(doc、html),只要维护表格即可。
到目前为止,我发现的任何内容要么是一次性的(一次只做一个文档,我有数百个文档,但这种情况没有发生),要么不维护表结构。
有什么想法请留言。
I have a bunch of PDF docs with tabular data in them which I need to extract into a more readable format to store in a spreadsheet, database or whatever.
Is there anything out in the world (preferably free) that is able to get tabular data out of PDFs into a more readable format in bulk either natively integrated with an app or passively via command line or looping the process in code(.net)?
Can be any format really (doc, html) just as long as the tables are maintained.
Anything I've found so far is either a one-off (only does one doc at a time, I have hundreds, that isn't happening) or does not maintain the table structure.
Any ideas please post.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
这是一个巨大的麻烦。一般来说,提取 PDF 文件的文本内容与 PDF 希望您执行的操作背道而驰。
首先尝试将文本取出来。这可能或多或少成功,具体取决于 PDF 的构建方式。可以从 GhostScript 或 pstotext 开始。如果你失败了,这个人有一个文本提取工具列表。获得文本流后,您可以尝试以编程方式重新组装表格结构。
最后,如果您的状况非常糟糕,并且 PDF 不配合,您可以进行 OCR 操作。正确的长期解决方案是从一开始就将数据转换为正确的格式,要么通过执行单个、大规模、痛苦且可能部分手动的过程;或者转到来源并建议以更有用的形式提供数据。
如果您可以提供更具体的 PDF 示例文件,可能会有更好或更精确的答案...对此没有通用的解决方案,如果可能的话,需要根据您的特定源数据进行定制。
请注意这个对一般性问题...对于解决您面前的问题这一事实没有帮助,但在向您的老板解释为什么没有明显的答案时,它可能会提供有用的封面? ;-)
出现了一个新的 SO 问题,并引用了这个库 - iTextSharp - 看起来可能相关。所以问题: 提取的最佳方法...
This is a giant hassle. In general, extracting the text content of a PDF file is running against the grain of what PDF wants you to do.
Start by trying to get the text out. This may be more or less successful, depending on how the PDF is built. One place to start is GhostScript or pstotext. If that fails you, this guy has a list of text extraction tools. Once you have the text stream, you could then try to reassemble the tabular structure programmatically.
Finally, if you are in seriously bad shape, and if the PDFs don't cooperate, you could do the OCR thing. The right long term solution is to get the data into the right format at the outset, either by doing a single, massive, painful, and probably partially-manual process; or to go to the source and suggest that the data be provided in a more useable form.
If you can give a more specific PDF example file, there may be a better or more precise answer... there is NO general solution to this, if it's possible, it will need to be tailored to your specific source data.
Note this rather pointed response to the general question... doesn't help with the fact that you have the problem in front of you, but maybe it would provide useful topcover when explaining to your boss why there isn't an obvious answer? ;-)
A new SO question popped up, and referred to this library -- iTextSharp -- which looks possibly related. SO question: Best way to extract...
有关为什么 PDF 文件格式永远不应该被认为适合托管可提取的结构化数据的背景信息,请参阅这篇文章:
对于从 PDF 中提取表格数据(除非它们是扫描页面)的一系列令人惊叹的工具来说,它们每周都变得越来越好,矛盾点“1”。上面请参阅这些链接:
For a background about why the PDF file format should never, ever be thought of as suitable for hosting extractable, structured data, see this article:
For an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), contradicting point '1.' above see these links:
查看 IvyTools IvyPdf:www.ivytools.net
它可以提取表格以及任何其他数据。如果您的文档结构良好,那么设置起来非常容易,而且它也可以处理相当复杂的场景。它免费供个人使用。
Check out IvyTools IvyPdf: www.ivytools.net
It can extract tables as well as any other data. If your documents are well-structured it's very easy to setup, but it can deal with pretty complicated scenarios too. It's free for personal use.
考虑到您的要求,直接回答您的问题是完全不可能的。原因是,与word/excel不同,PDF规范没有称为Table的对象。您在这些 PDF 文档中看到的表格只是一系列以看起来像表格的方式绘制的矩形,这取决于创建这些 PDF 文件的 PDF Writer,因为有些人可能会使用系列线绘制表格类型的结构。
但您可能可以根据 PDF 文件规范 编写自己的解析器,但它是如果您选择实现自己的解析器,这仍然是一项艰巨的任务,并且需要几个月的时间才能获得一个可以处理相当多 PDF 文档的解析器。
万一,您决定编写自己的解析器。下面的文章将帮助您快速入门。
代码项目文章
Considering your requirement, Straight forward answer for your question would be it is quite not possible. The reason is, unlike word/excel, PDF specification does not have a object called Table. The table which you see in those PDF documents are just series of rectangle drawn in such a way that it looks like table and it is up to PDF Writer which created those PDF files, because some might draw table kind of structure using Series of Line.
But possibly you could write your own parser based on PDF File Specification , but it is still a daunting task if you choose to implement your own parser and it will take several months to get one which is working with quite a few PDF documents.
Incase, you decided to write your own parser. The below article would give you jump start.
Code Project Article
PDF 格式是作为字母的集合构建的,没有固有的格式或任何东西。您可以将 PDF 视为通过 OCR 的页面,您可以从那里获取它 - 字母及其坐标就在那里 - 其余的由您决定 - 找出布局、格式、列和最终的表格。
PDF format is build as a collection of letters, which have no inherent format or anything. You can think of PDF just as a page that has come through the OCR and you are taking it from there - letters and their coordinates are there - rest is up to you - to figure out layout, formats, columns, and eventual tables.
当你说
我假设你的意思是“是一个 GUI 应用程序,没有编程接口”。
在这种情况下,您可以使用 Microsoft UI Automation 以编程方式控制应用程序并使其执行您想要的操作。
When you say
I'll assume you mean "is a GUI app, without a programming interface."
In this case you could use Microsoft UI Automation to programmatically control the app and make it do what you want.
如果所有数据都是文本数据,您始终可以使用 iTextSharp。它是免费的,您只需要“itextsharp.dll”。
http://sourceforge.net/projects/itextsharp/
这是一个用于读出文本的简单函数PDF 的。
这至少会让你得到开始的文本。
If all the data is text data, you can always use iTextSharp. It's free and you only need the "itextsharp.dll".
http://sourceforge.net/projects/itextsharp/
Here is a simple function for reading the text out of a PDF.
That will at least get you the text to start with.
我尝试使用 pdf2text 等工具从 PDF 中提取纯文本,但太多表格、格式和布局信息丢失,无法准确重建原始版本。
使用 PDF API 提取文本框和线条的 x,y 位置并使用该信息重建表格可能会更成功。
似乎有几个第三方工具和 API 尝试了这种方法:
Solid Framework 的付费版本 似乎能够自动从 PDF 中提取表格到 Excel 和 CSV,而且从我扔给它的 PDF 中提取得相当好。
免费的 PDF Mechanic 似乎是一个围绕 Solid Framework 的小型 GUI 程序,您可以使用它来尝试展示他们的 PDF 提取技术。
还有免费工具 pdf2table 您可以从程序中调用它,但我还没有尝试过。
I've tried extracting the plain text from PDFs using tools like pdf2text, but too much of the table and formatting and layout information is lost to accurately reconstruction the original version.
It might be more successful to use a PDF API to extract the x,y positions of the text boxes and lines, and use that information to reconstruct the table.
There seem to be several third party tools and APIs that try this approach:
The paid version of Solid Framework seems to be able to extract tables from PDF to Excel and CSV automatically and fairly well from the PDFs I've thrown at it.
The free PDF Mechanic seems to be a small GUI program wrapped around Solid Framework, which you can use to try out their PDF extraction technique.
There's also the free tool pdf2table which you might be able to call from your program, but I haven't tried it yet.
我最近遇到了这个问题。
我发现的另一种解决方案是在 Adobe 中打开 PDF 文档并将其导出为 xml。至少对于我的 PDF,它保留了表格信息,然后我能够以编程方式使用 XML 来生成表格文件,如 Excel 等。
我遇到的另一个问题是 Adobe 一次只允许导出一个文件,而我有很多文件。幸运的是Adobe也有合并功能。我最终将所有文件合并在一起,然后将它们导出为一个大 XML 文件,并使用该文件生成我需要的内容。
I recently ran into this problem.
An alternate solution I found was to open a PDF document in Adobe and export it to xml. At least with my PDF's it preserved the table information and then I was able to programmatically work with the XML to generate tabular files like excel etc.
The other issue I ran into was that Adobe only lets you export one file at a time and I had lots of files. Luckily Adobe also has a merge function. I ended up merging all the files together and then exporting them as one big XML file and working with that file to generate what I needed.