使用 Apache Lucene 解析大型 PDF 文件
我试图找出搜索/解析一组大型 pdf 文件的最佳方法。我目前正在使用 PDFBox 将 PDF 文件转换为文本文件。然后我使用 Lucene 来索引这些文本文件并搜索信息。我在使用这种方法时遇到一些问题。 (请注意,我在非常基础的级别上使用这两种技术只是为了看看它们能做什么)。
考虑我的 PDF 文件中的以下行,它给出了所有列的总计。每列包含一对值,其总计显示如下。
Grand Total $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17
当我使用 PDFBox 中的 TextStripper 将 pdf 文件转换为文本文件时,pdf 文件中的上述行将转换为文本文件中的以下文本。
58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17
从上面的文本文件可以看出,数据分散在“总计”标签周围。因此,由于 PDF 文件中的缩进未在文本文件中保留,因此检索总计信息变得困难。
因此,我想知道是否有一种方法可以将 PDF 文件转换为文本文件,以便文本文件保留 PDF 文件的缩进/格式。我还想知道 Lucene 是否是实现我的目标的好主意,或者是否有更简单、更快的方法来从一组大型 PDF 文件中检索信息?
I am trying to find out the best way to search/parse a set of large pdf file. I am currently using PDFBox to convert my PDF files to text files. I am then using Lucene to index these text files and search for information. I am facing some problems using this approach. ( Note that I am using both these technologies at a very basic level just to see what they can do) .
Consider the following line from my PDF file that gives the Grand total of all the columns. Each column contains a pair of values whose total is displayed as follows.
Grand Total $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17
When I convert my pdf file to a text file using TextStripper from PDFBox, The above line from the pdf file is converted to the following text in the text file.
58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17
As it can be seen from the text file above, the data is scattered around the Grand Total label. Therefore, it becomes difficult to retrieve the Grand total information as the indentation from the PDF file is not maintained in the text file.
I would therefore like to know if there is a way to convert the PDF file to a text file such that the text file maintains the indentations/format from the PDF file. I would also like to know if Lucene is a good idea to achieve my objective or is there a simpler and faster way to retrieve information from a set of large PDF files?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试 Tika。 (通常当人们从 PDF 中提取数据到 Lucene 时,他们会使用 Tika。)
有没有更简单的方法? Solr 与 Tika 具有强大的集成,这使得索引 PDF 文档变得非常容易。 (Solr 是 Lucene 的包装器。)
You can try Tika. (Generally when people extract data from PDFs into Lucene, they use Tika.)
Is there an easier way? Solr has strong integration with Tika, which should make it quite easy to index PDF documents. (Solr is a wrapper around Lucene.)