如何从 PDF 文件中提取数据,同时跟踪其结构?

发布于 2024-07-23 02:32:46 字数 338 浏览 6 评论 0原文

我的目标是从 PDF 文件中提取文本和图像,同时解析其结构。 解析结构的范围并不详尽; 我只需要能够识别标题和段落。

我尝试了一些不同的方法,但在其中任何一个方面都没有取得很大进展:

  • 将 PDF 转换为文本。 它对我不起作用,因为我丢失了图像和文档的结构。
  • 将 PDF 转换为 HTML。 我发现了一些可以帮助我完成此任务的工具,迄今为止最好的工具是 pdftohtml。 该工具在演示方面确实非常好,但我无法成功解析 HTML。
  • 将 PDF 转换为 XML。 与上面相同。

有人对如何解决这个问题有什么建议吗?

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to identify headings and paragraphs.

I have tried a few of different things, but I did not get very far in any of them:

  • Convert PDF to text. It does not work for me as I lose images and the structure of the document.
  • Convert PDF to HTML. I found a few tools that helped me with this, and the best one so far is pdftohtml. The tool is really good presentation wise, but I haven't been able to successfully parse the HTML.
  • Convert PDF to XML. Same as above.

Anyone has any suggestions on how to tackle this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

夜空下最亮的亮点 2024-07-30 02:32:46

本质上没有一个简单的剪切和粘贴解决方案,因为 PDF 对结构并不是很感兴趣。 该网站上还有许多其他答案可以更详细地告诉您一些事情,但是这个答案应该为您提供要点:

如果识别 PDF 文档中的文本结构如此困难,那么 PDF 阅读器是如何做到如此出色的呢?

如果如果您想在 PDF 本身中执行此操作(您将在其中对过程进行大部分控制),则必须循环遍历页面上的所有文本并通过查看其文本属性(使用的字体、相对于文本的大小)来识别标题页面上的其他文本等...)。

除此之外,您还必须通过查看文本片段的位置、页面上的空白、某些字母、单词和线条的紧密程度来识别段落……PDF 本身甚至没有“段落”的概念。 “字”,更不用说“行”或“段落”了。

更复杂的是,文本在页面上绘制的方式(以及文本在 PDF 文件本身中出现的顺序)甚至不必是正确的阅读顺序(或者我们人类认为正确的顺序)阅读顺序)。

There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).

On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".

To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).

海之角 2024-07-30 02:32:46

PDF 解析标题及其子内容确实非常困难(这并不意味着它不可能),因为 PDF 有多种格式。 但我最近遇到了名为 GROBID 的工具,它可以在这种情况下提供帮助。 我知道它并不完美,但如果我们提供适当的培训,它可以实现我们的目标。

Grobid 在 github 上作为开源代码提供。

https://github.com/kermitt2/grobid

PDF parsing for headers and its sub contents are really very difficult (It doesn't mean its impossible ) as PDF comes in various formats. But I recently encountered with tool named GROBID which can helps in this scenario. I know it's not perfect but if we provide proper training it can accomplish our goals.

Grobid available as a opensource on github.

https://github.com/kermitt2/grobid

画尸师 2024-07-30 02:32:46

您可以对 iTextSharp 或其他开源库使用以下方法:

  • 使用 iTextSharp 或类似开源库读取 PDF 文件工具并将所有文本对象收集到一个数组中(或使用 pdftohtml 等工具将 PDF 转换为 HTML,然后解析 HTML)
  • 按坐标对所有文本对象进行排序,以便将它们全部放在一起
  • 然后迭代对象并检查它们之间的距离,看看是否可以将 2 个或更多对象合并到一个段落中

或者您可以使用商业工具,例如 < a href="http://bytescout.com/products/developer/pdfextractorsdk/extract-from-pdf-to-excel-csv-in-csharp" rel="nofollow">ByteScout PDF Extractor SDK 即能够做到这一点:

  • 提取文本和图像,并分析文本
  • XML 或 CSV 的布局,其中文本对象被合并或拆分为虚拟布局网格内的段落
  • 通过特殊的 API 访问对象,从而可以通过其对象来寻址每个对象“虚拟”行和列索引,不管它如何存储在原始 PDF 中。

免责声明:我隶属于 ByteScout

You may do use the following approach like this with iTextSharp or other open source libraries:

  • Read PDF file with with iTextSharp or similar open source tools and collect all text objects into an array (or convert PDF to HTML using the tool like pdftohtml and then parse HTML)
  • Sort all text objects by coordinates so you will have them all together
  • Then iterate through objects and check the distance between them to see if 2 or more objects can be merged into one paragraph or not

Or you may use the commercial tool like ByteScout PDF Extractor SDK that is capable of doing exactly this:

  • extract text and images along with analyzing the layout of the text
  • XML or CSV where text objects are merged or splitted into paragraphs inside a virtual layout grid
  • access objects via special API that makes it possible to address each object via its "virtual" row and column index disregarding how it is stored inside the original PDF.

Disclaimer: I am affiliated with ByteScout

萝莉病 2024-07-30 02:32:46

PDF 文件可以使用 tabula-py 或 tabula-java 进行解析。

我制作了关于如何在 这篇文章。 只要您安装了 Java,您也可以在网络浏览器中制作表格。

PDF files can be parsed with tabula-py, or tabula-java.

I made a full tutorial on how to use tabula-py on this article. You can tabula in a web-browser too as long as you have installed Java.

吐个泡泡 2024-07-30 02:32:46

除非它是标记内容,否则 PDF 没有结构......您必须“猜测”它,这就是各种工具正在做的事情。 有一篇很好的博客文章解释了这些问题 http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structed-content/< /a>

Unless its is Marked Content, PDF does not have a structure.... You have to 'guess' it which is what the various tools are doing. There is a good blog post explaining the issues at http://blog.idrsolutions.com/2010/09/the-easy-way-to-discover-if-a-pdf-file-contains-structured-content/

别把无礼当个性 2024-07-30 02:32:46

正如前面的答案中提到的,PDF 并不是很容易解析。 但是,如果您有关于要解析的文本的某些附加信息,则可以完成它。

  1. 如果您的标题位于页面的特定部分,您可以解析 PDF 文件并按坐标对解析的输出进行排序。

  2. 如果您事先了解标题和段落之间的间距,您还可以利用此信息来解析文件。

PDFBox 是一个 PDF 解析工具,您可以使用它来提取文本和图像,您可以在其上定义您的文本和图像。自定义解析规则。

但是,为了解析 PDF,您需要先了解 PDF 文件的一般格式。 您可以查看以下博客文章文档解析,了解有关文档解析的更多信息。

免责声明:我参与了这篇博文的撰写。

As mentioned in the previous answers, PDF's aren't very easy to parse. However, if you have certain additional information regarding the text that you want to parse, you can pull it off.

  1. If your headings are positioned at specific parts of the page, you can parse the PDF file and sort the parsed output by coordinates.

  2. If you have prior knowledge of the spacing between headings and paragraphs, you could also leverage this information to parse the file.

PDFBox is a PDF parsing tool that you can use for extracting text and images on top of which you can define your custom rules for parsing.

However, for parsing PDFs you need to have some prior knowledge of the general format of the PDF file. You can check out the following blogpost Document parsing for more information regarding document parsing.

Disclaimer:I was involved in writing the blogpost.

我不吻晚风 2024-07-30 02:32:46

iText API:
PdfReader pr=new PdfReader("C:\test.pdf");

参考:
PDF阅读器

iText api:
PdfReader pr=new PdfReader("C:\test.pdf");

References:
PDFReader

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文