使用 PDFBox 解析 PDF 文件（尤其是表格）

发布于 2024-09-08 19:36:24 字数 1175 浏览 10 评论 0原文

我需要解析包含表格数据的 PDF 文件。我正在使用 PDFBox 提取文件文本以稍后解析结果（字符串）。问题是文本提取无法按照我对表格数据的预期工作。例如，我有一个包含这样的表的文件（7 列：前两列始终有数据，只有一个复杂性列有数据，只有一个融资列有数据）：

+----------------------------------------------------------------+
| AIH | Value | Complexity                     | Financing       |
|     |       | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34  |      |                | 12.34     |     |
+----------------------------------------------------------------+
| abc | 1.56  |        | 1.56 |                |           | 1.56|
+----------------------------------------------------------------+

然后我使用 PDFBox：

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

这两行数据将像这样提取：

xyz 12.43 12.4312.43
abc 1.56 1.561.56

最后两个数字之间没有空格，但这不是最大的问题。问题是我不知道最后两个数字是什么意思：中、高、不适用？ MAC/其他、FAE？我没有数字和它们的列之间的关系。

我不需要使用 PDFBox 库，因此使用另一个库的解决方案就可以了。我想要的是能够解析文件并知道每个解析的数字的含义。

原文

I need to parse a PDF file which contains tabular data. I'm using PDFBox to extract the file text to parse the result (String) later. The problem is that the text extraction doesn't work as I expected for tabular data. For example, I have a file which contains a table like this (7 columns: the first two always have data, only one Complexity column has data, only one Financing column has data):

+----------------------------------------------------------------+
| AIH | Value | Complexity                     | Financing       |
|     |       | Medium | High | Not applicable | MAC/Other | FAE |
+----------------------------------------------------------------+
| xyz | 12.43 | 12.34  |      |                | 12.34     |     |
+----------------------------------------------------------------+
| abc | 1.56  |        | 1.56 |                |           | 1.56|
+----------------------------------------------------------------+

Then I use PDFBox:

PDDocument document = PDDocument.load(pathToFile);
PDFTextStripper s = new PDFTextStripper();
String content = s.getText(document);

Those two lines of data would be extracted like this:

xyz 12.43 12.4312.43
abc 1.56 1.561.56

There are no white spaces between the last two numbers, but this is not the biggest problem. The problem is that I don't know what the last two numbers mean: Medium, High, Not applicable? MAC/Other, FAE? I don't have the relation between the numbers and their columns.

It is not required for me to use the PDFBox library, so a solution that uses another library is fine. What I want is to be able to parse the file and know what each parsed number means.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦醒时光 2024-09-15 19:36:24

您需要设计一种算法来以可用的格式提取数据。无论您使用哪个 PDF 库，都需要执行此操作。字符和图形是通过一系列有状态绘制操作来绘制的，即移动到屏幕上的该位置并绘制字符“c”的字形。

我建议您扩展org.apache.pdfbox.pdfviewer.PDFPageDrawer并覆盖StrokePath方法。从那里，您可以拦截水平和垂直线段的绘制操作，并使用该信息来确定表格的列和行位置。然后，设置文本区域并确定在哪个区域绘制哪些数字/字母/字符就很简单了。由于您知道区域的布局，因此您将能够判断提取的文本属于哪一列。

此外，视觉上分隔的文本之间可能没有空格的原因是，PDF 通常不会绘制空格字符。相反，更新文本矩阵并发出“移动”的绘图命令来绘制下一个字符以及与上一个字符分开的“空格宽度”。

祝你好运。

使用 PDFBox 解析 PDF 文件（尤其是表格）

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（19）

关于作者

相关话题

热门标签

推荐作者

離殇

小姐丶请自重

Aik

国产ˉ祖宗

猥琐帝

半仙

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。