使用 PDFBox 解析 PDF 文件(尤其是表格)
我需要解析包含表格数据的 PDF 文件。我正在使用 PDFBox 提取文件文本以稍后解析结果(字符串)。问题是文本提取无法按照我对表格数据的预期工作。例如,我有一个包含这样的表的文件(7 列:前两列始终有数据,只有一个复杂性列有数据,只有一个融资列有数据):
+----------------------------------------------------------------+
| AIH | Value | Complexity | Financing |
| | | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34 | | | 12.34 | |
+----------------------------------------------------------------+
| abc | 1.56 | | 1.56 | | | 1.56|
+----------------------------------------------------------------+
然后我使用 PDFBox:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
这两行数据将像这样提取:
xyz 12.43 12.4312.43
abc 1.56 1.561.56
最后两个数字之间没有空格,但这不是最大的问题。问题是我不知道最后两个数字是什么意思:中、高、不适用? MAC/其他、FAE?我没有数字和它们的列之间的关系。
我不需要使用 PDFBox 库,因此使用另一个库的解决方案就可以了。我想要的是能够解析文件并知道每个解析的数字的含义。
I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):
+----------------------------------------------------------------+
| AIH | Value | Complexity | Financing |
| | | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34 | | | 12.34 | |
+----------------------------------------------------------------+
| abc | 1.56 | | 1.56 | | | 1.56|
+----------------------------------------------------------------+
Then I use PDFBox:
PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);
Those two lines of data would be extracted like this:
xyz 12.43 12.4312.43
abc 1.56 1.561.56
There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.
It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(19)
您需要设计一种算法来以可用的格式提取数据。无论您使用哪个 PDF 库,都需要执行此操作。字符和图形是通过一系列有状态绘制操作来绘制的,即移动到屏幕上的该位置并绘制字符“c”的字形。
我建议您扩展
org.apache.pdfbox.pdfviewer.PDFPageDrawer
并覆盖StrokePath
方法。从那里,您可以拦截水平和垂直线段的绘制操作,并使用该信息来确定表格的列和行位置。然后,设置文本区域并确定在哪个区域绘制哪些数字/字母/字符就很简单了。由于您知道区域的布局,因此您将能够判断提取的文本属于哪一列。此外,视觉上分隔的文本之间可能没有空格的原因是,PDF 通常不会绘制空格字符。相反,更新文本矩阵并发出“移动”的绘图命令来绘制下一个字符以及与上一个字符分开的“空格宽度”。
祝你好运。
You will need to devise an algorithm to extract the data in a usable format. Regardless of which PDF library you use, you will need to do this. Characters and graphics are drawn by a series of stateful drawing operations, i.e. move to this position on the screen and draw the glyph for character 'c'.
I suggest that you extend
org.apache.pdfbox.pdfviewer.PDFPageDrawer
and override thestrokePath
method. From there you can intercept the drawing operations for horizontal and vertical line segments and use that information to determine the column and row positions for your table. Then its a simple matter of setting up text regions and determining which numbers/letters/characters are drawn in which region. Since you know the layout of the regions, you'll be able to tell which column the extracted text belongs to.Also, the reason you may not have spaces between text that is visually separated is that very often, a space character is not drawn by the PDF. Instead the text matrix is updated and a drawing command for 'move' is issued to draw the next character and a "space width" apart from the last one.
Good luck.
我使用了很多工具从 pdf 文件中提取表格,但它对我不起作用。
所以我实现了自己的算法(其名称为
traprange
)来解析pdf文件中的表格数据。以下是一些示例 pdf 文件和结果:
访问我的项目页面 陷阱范围。
I had used many tools to extract table from pdf file but it didn't work for me.
So i have implemented my own algorithm ( its name is
traprange
) to parse tabular data in pdf files.Following are some sample pdf files and results:
Visit my project page at traprange.
您可以在 PDFBox 中按区域提取文本。如果您使用的是 Maven,请参阅
pdfbox-examples
工件中的ExtractByArea.java
示例文件。一个片段看起来像问题是首先获取坐标。我已经成功地扩展了普通的 TextStripper,覆盖了 processTextPosition(TextPosition text) 并打印出了每个字符的坐标并找出了它们在文档中的位置。
但还有一种更简单的方法,至少如果您使用的是 Mac。在预览中打开 PDF,⌘I 显示检查器,选择“裁剪”选项卡并确保单位为“点”,从“工具”菜单中选择“矩形选择”,然后选择感兴趣的区域。如果您选择一个区域,检查器将向您显示坐标,您可以将其舍入并输入到 Rectangle 构造函数参数中。您只需使用第一种方法确认原点在哪里即可。
You can extract text by area in PDFBox. See the
ExtractByArea.java
example file, in thepdfbox-examples
artifact if you're using Maven. A snippet looks likeThe problem is getting the coordinates in the first place. I've had success extending the normal
TextStripper
, overridingprocessTextPosition(TextPosition text)
and printing out the coordinates for each character and figuring out where in the document they are.But there's a much simpler way, at least if you're on a Mac. Open the PDF in Preview, ⌘I to show the Inspector, choose the Crop tab and make sure the units are in Points, from the Tools menu choose Rectangular selection, and select the area of interest. If you select an area, the inspector will show you the coordinates, which you can round and feed into the
Rectangle
constructor arguments. You just need to confirm where the origin is, using the first method.我的回答可能为时已晚,但我认为这并不难。您可以扩展 PDFTextStripper 类并重写 writePage() 和 processTextPosition(...) 方法。在您的情况下,我假设列标题始终相同。这意味着您知道每个列标题的 x 坐标,并且可以将数字的 x 坐标与列标题的 x 坐标进行比较。如果它们足够接近(您必须测试以确定有多接近),那么您可以说该数字属于该列。
另一种方法是在写入每个页面后拦截“charactersByArticle”向量:
了解您的列,您可以比较 x 坐标来确定每个数字属于哪一列。
数字之间没有任何空格的原因是您必须设置单词分隔符字符串。
我希望这对您或其他可能尝试类似事情的人有用。
It may be too late for my answer, but I think this is not that hard. You can extend the PDFTextStripper class and override the writePage() and processTextPosition(...) methods. In your case I assume that the column headers are always the same. That means that you know the x-coordinate of each column heading and you can compare the the x-coordinate of the numbers to those of the column headings. If they are close enough (you have to test to decide how close) then you can say that that number belongs to that column.
Another approach would be to intercept the "charactersByArticle" Vector after each page is written:
Knowing your columns, you can do your comparison of the x-coordinates to decide what column every number belongs to.
The reason you don't have any spaces between numbers is because you have to set the word separator string.
I hope this is useful to you or to others who might be trying similar things.
PDFLayoutTextStripper 旨在保持数据的格式。
来自自述文件:
There's PDFLayoutTextStripper that was designed to keep the format of the data.
From the README:
我在解析 pdftotext 实用程序生成的文本文件方面取得了不错的成功(sudo apt-get install poppler-utils)。
I've had decent success with parsing text files generated by the pdftotext utility (sudo apt-get install poppler-utils).
尝试使用 TabulaPDF (https://github.com/tabulapdf/tabula) 。这是一个非常好的库,可以从 PDF 文件中提取表格内容。非常符合预期。
祝你好运。 :)
Try using TabulaPDF (https://github.com/tabulapdf/tabula) . This is very good library to extract table content from the PDF file. It is very as expected.
Good luck. :)
Camelot 和 Excalibur
您可能想尝试 Python 库 Camelot,Python 的开源库。如果您不想编写代码,您可以使用创建的 Web 界面 Excalibur卡米洛特周围。您将文档“上传”到本地主机 Web 服务器,然后从此本地主机服务器“下载”结果。
以下是使用此 python 代码的示例:
输入是包含此表的 pdf:
来自 PDF-TREX set
没有向camelot提供帮助,它通过查看文本相对对齐来自行工作。结果以 csv 文件形式返回:
camelot 从样本中提取的 PDF 表格
可以添加“规则”来帮助 Camelot 识别复杂表格中圆角的位置:
Excalibur 中添加了规则。 来源< /a>
GitHub:
有两个项目正在进行中。
此处 是与其他软件的对比(根据实际文档进行测试),Tabula,pdfplumber, pdftables, pdf-table-extract。
你不能自动做到这一点,因为 pdf 没有语义结构。
书籍与文档
从语义角度来看,Pdf“文档”是非结构化的(就像记事本文件),pdf 文档给出了在何处打印文本片段的说明,与同一部分的其他片段无关,内容(要打印的内容,以及是否是标题、表格或脚注的片段)和视觉表示(字体、位置等)之间没有分离。 Pdf 是 PostScript 的扩展,它描述了一个 Hello world!页面如下:(
维基百科)。
人们可以想象使用相同指令的桌子是什么样子。
我们可以说 html 并不清晰,但有一个很大的区别:Html 从语义上描述内容(标题、段落、列表、表格标题、表格单元格……)并关联 css 以产生视觉形式,因此内容是完全无障碍。从这个意义上说,html 是 sgml 的简化后代,它设置了允许数据处理的约束:
与 PostScript/Pdf 正好相反。 SGML 用于出版。 Pdf 不嵌入这种语义结构,它仅携带与纯字符串相关的 css 等效项,这些字符串可能不是完整的单词或句子。 Pdf 用于封闭文档,现在用于所谓的 工作流程管理。
在尝试从 pdf 中提取数据的不确定性和困难之后,很明显 pdf 根本不是为未来保留文档内容的解决方案(尽管 Adobe 已经从他们的配对中获得了 pdf 标准)。
实际上保存完好的是印刷版,因为 pdf 在创建时完全致力于这方面。 PDF 几乎和印刷书籍一样消亡。
当重用内容很重要时,必须再次依赖手动重新输入数据,例如从印刷书籍中输入数据(可能尝试对其进行一些 OCR)。这一点越来越正确,因为许多 pdf 甚至阻止使用复制粘贴,在单词之间引入多个空格,或者在为网络使用进行一些“优化”时产生无序的字符乱码。
当文档的内容(而不是其印刷形式)有价值时,pdf 就不是正确的格式。即使 Adobe 也无法从 pdf 渲染中完美地重新创建文档的源代码。
因此,开放数据永远不应该以 pdf 格式发布,这限制了它们的阅读和打印(如果允许)的使用,并使重用变得更加困难或不可能。
Camelot and Excalibur
You may want to try Python library Camelot, an open source library for Python. If you are not inclined to write code, you may use the web interface Excalibur created around Camelot. You "upload" the document to a localhost web server, and "download" the result from this localhost server.
Here is an example from using this python code:
The input is a pdf containing this table:
Sample table from the PDF-TREX set
No help is provided to camelot, it is working on its own by looking at pieces of text relative alignment. The result is returned in a csv file:
PDF table extracted from sample by camelot
"Rules" can de added to help camelot identify where are fillets in sophisticated tables:
Rule added in Excalibur. Source
GitHub:
The two projects are active.
Here is a comparison with other software (with test based on actual documents), Tabula, pdfplumber, pdftables, pdf-table-extract.
You cannot do that automatically, as pdf is not semantically structured.
Book versus document
Pdf "documents" are unstructured from a semantic standpoint (it's like a notepad file), the pdf document gives instructions on where to print a text fragment, unrelated to other fragments of the same section, there is no separation between content (what to print, and whether this is a fragment of a title, a table or a footnote) and the visual representation (font, location, etc). Pdf is an extension of PostScript, which describes a Hello world! page this way:
(Wikipedia).
One can imagine what a table looks like with the same instructions.
We could say html is not clearer, however there is a big difference: Html describes the content semantically (title, paragraph, list, table header, table cell, ...) and associates the css to produce a visual form, hence content is fully accessible. In this sense, html is a simplified descendant of sgml which puts constraints to allow data processing:
exactly the opposite of PostScript/Pdf. SGML is used in publishing. Pdf doesn't embed this semantical structure, it carries only the css-equivalent associated to plain character strings which may not be complete words or sentences. Pdf is used for closed documents and now for the so-called workflow management.
After having experimented the uncertainty and difficulty in trying to extract data from pdf, it's clear pdf is not at all a solution to preserve a document content for the future (in spite Adobe has obtained from their pairs a pdf standard).
What is actually preserved well is the printed representation, as the pdf was fully dedicated to this aspect when created. Pdf are nearly as dead as printed books.
When reusing the content matters, one must rely again on manual re-entering of data, like from a printed book (possibly trying to do some OCR on it). This is more and more true, as many pdf even prevent the use of copy-paste, introducing multiple spaces between words or produce an unordered characters gibberish when some "optimization" is done for web use.
When the content of the document, not its printed representation, is valuable, then pdf is not the correct format. Even Adobe is unable to recreate perfectly the source of a document from its pdf rendering.
So open data should never be released in pdf format, this limits their use to reading and printing (when allowed), and makes reuse harder or impossible.
从 PDF 中提取数据必然会遇到很多问题。这些文档是通过某种自动过程创建的吗?如果是这样,您可以考虑将 PDF 转换为未压缩的 PostScript(尝试 pdf2ps),并查看 PostScript 是否包含某种可以利用的常规模式。
Extracting data from PDF is bound to be fraught with problems. Are the documents created through some kind of automatic process? If so, you might consider converting the PDFs to uncompressed PostScript (try pdf2ps) and seeing if the PostScript contains some sort of regular pattern which you can exploit.
我在阅读表格格式的pdf文件时遇到了同样的问题。使用 PDFBox 进行常规解析后,每行都用逗号作为分隔符提取......丢失了列位置。
为了解决这个问题,我使用了 PDFTextStripperByArea 并使用坐标为每行逐列提取数据。 前提是您有固定格式的 pdf。
然后是第 2 行,依此类推...
I had the same problem in reading the pdf file in which data is in tabular format. After regular parse using PDFBox each row were extracted with comma as a separator... losing the columnar position.
To resolve this I used PDFTextStripperByArea and using coordinates I extracted the data column by column for each row. This is provided that you have a fixed format pdf.
Then row 2 and so on...
您可以使用 PDFBox 的
PDFTextStripperByArea
类从文档的特定区域提取文本。您可以在此基础上确定表中每个单元格的区域。这不是开箱即用的,但示例DrawPrintTextLocations
类演示了如何解析文档中单个字符的边界框(最好解析字符串或段落的边界框,但我还没有看到 PDFBox 对此的支持 - 请参阅此然后,您可以将这些区域传递给
PDFTextStripperByArea
。您还可以更进一步,分离出这些区域的水平和垂直分量,从而推断所有表格单元格的区域,无论是否包含任何内容。
我有理由执行这些步骤,并最终使用 PDFBox 编写了自己的
PDFTableStripper
类一个>。我已将我的代码作为 gist 在 GitHub 上共享。main
方法给出了如何使用该类的示例:You can use PDFBox's
PDFTextStripperByArea
class to extract text from a specific region of a document. You can build on this by identifying the region each cell of the table. This isn't provided out of the box, but the exampleDrawPrintTextLocations
class demonstrates how you can parse the bounding boxes of individual characters in a document (it would be great to parse bounding boxes of strings or paragraphs, but I haven't seen support in PDFBox for this - see this question). You can use this approach to group up all touching bounding boxes to identify distinct cells of a table. One way to do this is to maintain a setboxes
ofRectangle2D
regions and then for each parsed character find the character's bounding box as inDrawPrintTextLocations.writeString(String string, List<TextPosition> textPositions)
and merge it with the existing contents.You can then pass these regions to
PDFTextStripperByArea
.You can also go one further and separate out the horizontal and vertical components of these regions, and so infer regions of all the table's cells, regardless of whether then hold any content.
I have had cause to perform these steps, and eventually wrote my own
PDFTableStripper
class using PDFBox. I've shared my code as a gist on GitHub. Themain
method gives an example of how the class can be used:打印到图像并对其进行 OCR 怎么样?
听起来非常无效,但实际上 PDF 的目的就是让文本变得不可访问,你必须做你必须做的事情。
How about printing to image and doing OCR on that?
Sounds terribly ineffective, but it's practically the very purpose of PDF to make text inaccessible, you gotta do what you gotta do.
http://swftools.org/ 这些人有一个 pdf2swf 组件。他们还能够显示表格。
他们还给出了来源。所以你可以检查一下。
http://swftools.org/ these guys have a pdf2swf component. They are also able to show tables.
They are also giving the source. So you could possibly check it out.
如果 PDF 文件使用 pdfbox 2.0.6 具有“仅矩形表格”,则效果很好。不适用于任何其他桌子,仅适用于矩形桌子。
This works fine if PDF file has "Only Rectangular table" using pdfbox 2.0.6. Won't work with any other table only Rectangular table.
对于任何想要与 OP 做同样事情的人(就像我一样),经过几天的研究 Amazon Textract 是最好的选择(如果您的数量较低,免费套餐可能就足够了)。
For anyone wanting to do the same thing as OP (as I do), after days of research Amazon Textract is the best option (if your volume is low free tier might be enough).
考虑使用 PDFTableStripper.class
该类在 git 上可用:
https://gist.github.com/beldaz/8ed6e7473bd228fcee8d4a3e4525be11#file- pdftablestripper-java-L1
consider using PDFTableStripper.class
The class is available on git :
https://gist.github.com/beldaz/8ed6e7473bd228fcee8d4a3e4525be11#file-pdftablestripper-java-L1
我不熟悉 PDFBox,但您可以尝试查看 itext。尽管主页上说生成 PDF,但您也可以进行 PDF 操作和提取。看一下它是否适合您的用例。
I'm not familiar with PDFBox, but you could try looking at itext. Even though the homepage says PDF generation, you can also do PDF manipulation and extraction. Have a look and see if it fits your use case.
要从 pdf 文件中读取表格内容,您只需使用任何 API 将 pdf 文件转换为文本文件(我使用 iText 的 PdfTextExtracter.getTextFromPage() ),然后通过 java 程序读取该 txt 文件..现在阅读完之后,主要任务就完成了..您必须过滤您需要的数据。您可以通过连续使用 String 类的 split 方法来完成此操作,直到找到您感兴趣的记录。这是我的代码,通过它我通过 PDF 文件提取部分记录并将其写入 .CSV 文件。 PDF 的 URL文件是..http://www.cea.nic .in/reports/monthly/ Generation_rep/actual/jan13/opm_02.pdf
代码:-
For reading content of the table from pdf file,you have to do only just convert the pdf file into a text file by using any API(I have use PdfTextExtracter.getTextFromPage() of iText) and then read that txt file by your java program..now after reading it the major task is done.. you have to filter the data of your need. you can do it by continuously using split method of String class until you find record of your intrest.. here is my code by which I have extract part of record by an PDF file and write it into a .CSV file.. Url of PDF file is..http://www.cea.nic.in/reports/monthly/generation_rep/actual/jan13/opm_02.pdf
Code:-