使用 Apache Lucene 解析大型 PDF 文件

发布于 2024-12-17 01:45:04 字数 820 浏览 7 评论 0原文

我试图找出搜索/解析一组大型 pdf 文件的最佳方法。我目前正在使用 PDFBox 将 PDF 文件转换为文本文件。然后我使用 Lucene 来索引这些文本文件并搜索信息。我在使用这种方法时遇到一些问题。（请注意，我在非常基础的级别上使用这两种技术只是为了看看它们能做什么）。

考虑我的 PDF 文件中的以下行，它给出了所有列的总计。每列包含一对值，其总计显示如下。

    Grand Total  $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
                 $10.00    $5.15   $25.60 $0.00 $2.69 $0.00  $0.00 $0.00 $3,768.17

当我使用 PDFBox 中的 TextStripper 将 pdf 文件转换为文本文件时，pdf 文件中的上述行将转换为文本文件中的以下文本。

    58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17

从上面的文本文件可以看出，数据分散在“总计”标签周围。因此，由于 PDF 文件中的缩进未在文本文件中保留，因此检索总计信息变得困难。

因此，我想知道是否有一种方法可以将 PDF 文件转换为文本文件，以便文本文件保留 PDF 文件的缩进/格式。我还想知道 Lucene 是否是实现我的目标的好主意，或者是否有更简单、更快的方法来从一组大型 PDF 文件中检索信息？

原文

I am trying to find out the best way to search/parse a set of large pdf file. I am currently using PDFBox to convert my PDF files to text files. I am then using Lucene to index these text files and search for information. I am facing some problems using this approach. ( Note that I am using both these technologies at a very basic level just to see what they can do) .

Consider the following line from my PDF file that gives the Grand total of all the columns. Each column contains a pair of values whose total is displayed as follows.

    Grand Total  $3,148.06 $484.80 $13.07 $8.90 $0.00 $69.90 $0.00 $0.00
                 $10.00    $5.15   $25.60 $0.00 $2.69 $0.00  $0.00 $0.00 $3,768.17

When I convert my pdf file to a text file using TextStripper from PDFBox, The above line from the pdf file is converted to the following text in the text file.

    58.20$3,148.06 $484.80 $13.07 $0.00 $0.00 $0.00Grand Total $8.90 $69.90$10.00 $5.15 $25.60 $0.00 $2.69 $0.00 $0.00 $0.00 $3,768.17

As it can be seen from the text file above, the data is scattered around the Grand Total label. Therefore, it becomes difficult to retrieve the Grand total information as the indentation from the PDF file is not maintained in the text file.

I would therefore like to know if there is a way to convert the PDF file to a text file such that the text file maintains the indentations/format from the PDF file. I would also like to know if Lucene is a good idea to achieve my objective or is there a simpler and faster way to retrieve information from a set of large PDF files?

分享到QQ

分享到微博