保持“长” PDFBox 文本提取中的空格
我正在使用 PDFBox 从 PDF 中提取文本。 PDF有一个表格结构,非常简单,列之间的间隔也很宽,
这非常有效,除了所有类型的水平空间都会转换为单个空格字符,这样我就无法再区分列了(列中单词内的空间看起来就像列之间的空间)。
我知道通用的解决方案非常困难,但在这种情况下,列确实相距很远,因此在“长空格”和“单词之间的空格”之间进行简单区分就足够了。
有没有办法告诉 PDFBox 将超过 x 英寸的水平空白转换为除单个空格之外的其他内容?比例方法(x 英寸变成 y 空间)也可以。
pdftotext C 库/工具 有一个“-layout”开关,尝试保留布局。基本上,如果我可以用 PDFBox 来模拟,那就完美了。
I am using PDFBox to extract text from PDF.
The PDF has a tabular structure, which is quite simple and columns are also very widely spaced from each-other
This works really well, except that all kinds of horizontal space gets converted into a single space character, so that I cannot tell columns apart anymore (space within words in a column looks just like space between columns).
I appreciate that a general solution is very hard, but in this case the columns are really far apart so that having a simple differentiation between "long spaces" and "space between words" would be enough.
Is there a way to tell PDFBox to turn horizontal whitespace of more then x inches into something other than a single space? A proportional approach (x inch become y spaces) would also work.
The pdftotext C library/tool has a '-layout' switch that tries to preserve the layout. Basically, if I can emulate that with PDFBox, that would be perfect.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
似乎没有这方面的设置,但我能够修改 PDFTextStripper 工具,用于在遇到“长”空格时输出列分隔符 (|)。在构建输出行的代码中,可以查看当前字母和前一个字母的 x 位置,如果它足够大,则执行一些特殊操作。 PDFTextStripper 有很多受保护的方法,但结果并不是那么可扩展。我最终不得不复制整个类来更改私有方法。
看看其中的代码,我觉得自己很幸运,对于特定的 PDF,这种简单的方法是成功的。更通用的解决方案似乎非常棘手。
There does not seem to be a setting for this, but I was able to modify the source for the PDFTextStripper tool to output a column separator (|) when a "long" space was encountered. In the code where it was building the output line it is possible to look at the x positions of the current and previous letter, and if it is large enough, do something special. PDFTextStripper has lots of protected methods, but turned out to be not really all that extensible. I ended up having to copy the whole class to change a private method.
Looking at the code in there, I call myself lucky that with the particular PDF, this simple approach was successful. A more general solution seems very tricky.
PDF 文本提取很困难。
如果文本输出为一个由空格分隔的大字符串,例如 :-
并且您使用的是固定宽度字体(例如 Courier),那么理论上您可以计算文本项之间的空格数,因为每个字符的宽度相同。如果字体是等比例的,例如 Arial,那么计算会更困难。
实际上,大多数 PDF 都是通过将每段文本直接单独放置到其位置来生成的。因此,从技术上讲,列之间不存在空格字符或任何其他字符。文本只是放置在页面上的绝对位置。
为了在 PDF 文档上执行数据提取,您必须做更多的工作来查找和匹配列数据,方法是使用您提到的像素位置,并做出一些假设并有一点运气。
PDF text extraction is difficult.
If the text was output as one big string separated by spaces such as :-
and you are using a fixed width font such as Courier then you could theoretically calculate the number of spaces between items of text because each character is the same width. If the font is proportional such a Arial then the calculation is harder.
In reality most PDF's generated by individually placing each piece of text directly into its position. Therefore, there is technically no space character or any other characters between columns. The text is just placed into an absolute position on the page.
In order to perform data extraction on PDF documents you have to do a little bit more work to find and match column data by using pixel locations as you have mentioned and by making some assumptions and having a little bit of luck.