如何从 PDF 文件中提取数据,同时跟踪其结构?
我的目标是从 PDF 文件中提取文本和图像,同时解析其结构。 解析结构的范围并不详尽; 我只需要能够识别标题和段落。
我尝试了一些不同的方法,但在其中任何一个方面都没有取得很大进展:
- 将 PDF 转换为文本。 它对我不起作用,因为我丢失了图像和文档的结构。
- 将 PDF 转换为 HTML。 我发现了一些可以帮助我完成此任务的工具,迄今为止最好的工具是 pdftohtml。 该工具在演示方面确实非常好,但我无法成功解析 HTML。
- 将 PDF 转换为 XML。 与上面相同。
有人对如何解决这个问题有什么建议吗?
My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.
I have tried a few of different things, but I did not get very far in any of them:
- Convert PDF to text. It does not work for me as I lose images and the structure of the document.
- Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
- Convert PDF to XML. Same as above.
Anyone has any suggestions on how to tackle this problem?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
本质上没有一个简单的剪切和粘贴解决方案,因为 PDF 对结构并不是很感兴趣。 该网站上还有许多其他答案可以更详细地告诉您一些事情,但是这个答案应该为您提供要点:
如果识别 PDF 文档中的文本结构如此困难,那么 PDF 阅读器是如何做到如此出色的呢?
如果如果您想在 PDF 本身中执行此操作(您将在其中对过程进行大部分控制),则必须循环遍历页面上的所有文本并通过查看其文本属性(使用的字体、相对于文本的大小)来识别标题页面上的其他文本等...)。
除此之外,您还必须通过查看文本片段的位置、页面上的空白、某些字母、单词和线条的紧密程度来识别段落……PDF 本身甚至没有“段落”的概念。 “字”,更不用说“行”或“段落”了。
更复杂的是,文本在页面上绘制的方式(以及文本在 PDF 文件本身中出现的顺序)甚至不必是正确的阅读顺序(或者我们人类认为正确的顺序)阅读顺序)。
There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:
If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?
If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).
On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".
To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).
PDF 解析标题及其子内容确实非常困难(这并不意味着它不可能),因为 PDF 有多种格式。 但我最近遇到了名为 GROBID 的工具,它可以在这种情况下提供帮助。 我知道它并不完美,但如果我们提供适当的培训,它可以实现我们的目标。
Grobid 在 github 上作为开源代码提供。
https://github.com/kermitt2/grobid
PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.
Grobid available as a opensource on github.
https://github.com/kermitt2/grobid
您可以对 iTextSharp 或其他开源库使用以下方法:
或者您可以使用商业工具,例如 < a href="http://bytescout.com/products/developer/pdfextractorsdk/extract-from-pdf-to-excel-csv-in-csharp" rel="nofollow">ByteScout PDF Extractor SDK 即能够做到这一点:
免责声明:我隶属于 ByteScout
You may do use the following approach like this with iTextSharp or other open source libraries:
Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:
Disclaimer: I am affiliated with ByteScout
PDF 文件可以使用 tabula-py 或 tabula-java 进行解析。
我制作了关于如何在 这篇文章。 只要您安装了 Java,您也可以在网络浏览器中制作表格。
PDF files can be parsed with tabula-py, or tabula-java.
I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.
除非它是标记内容,否则 PDF 没有结构......您必须“猜测”它,这就是各种工具正在做的事情。 有一篇很好的博客文章解释了这些问题 http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structed-content/< /a>
Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/
正如前面的答案中提到的,PDF 并不是很容易解析。 但是,如果您有关于要解析的文本的某些附加信息,则可以完成它。
如果您的标题位于页面的特定部分,您可以解析 PDF 文件并按坐标对解析的输出进行排序。
如果您事先了解标题和段落之间的间距,您还可以利用此信息来解析文件。
PDFBox 是一个 PDF 解析工具,您可以使用它来提取文本和图像,您可以在其上定义您的文本和图像。自定义解析规则。
但是,为了解析 PDF,您需要先了解 PDF 文件的一般格式。 您可以查看以下博客文章文档解析,了解有关文档解析的更多信息。
免责声明:我参与了这篇博文的撰写。
As mentioned in the previous answers, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.
If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.
If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.
PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.
However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.
Disclaimer:I was involved in writing the blogpost.
iText API:
PdfReader pr=new PdfReader("C:\test.pdf");
参考:
PDF阅读器
iText api:
PdfReader pr=new PdfReader("C:\test.pdf");
References:
PDFReader