.NET 项目最快的 PDF-> 文本库

发布于 2024-09-10 14:43:21 字数 607 浏览 7 评论 0原文

我正在尝试创建一个应用程序,它基本上是我的 PDF 收藏的目录。我们谈论的是包含数万个 PDF 的 15-20GB。我还计划加入全文搜索机制。我将使用 Lucene.NET 进行搜索(实际上是 NHibernate.Search),并使用一个用于 PDF-> 文本转换的库。哪个是最好的选择?我正在考虑这些:

  • PDFBox
  • pdftotext (from xpdf) via c#wrapper
  • iTextSharp

编辑: 其他不错的选择似乎是使用 iFilters。与这些库相比,它们的性能(Foxit/Adobe)(速度/质量)如何?

商业图书馆可能是不可能的,因为它是我的私人项目,而且我实际上没有商业解决方案的预算 - 尽管 PDFTextStream 看起来非常不错。

根据我阅读的内容,pdftotext 比 PDFBox 快很多。与 pdftotext 相比,iTextSharp 的性能如何?或者也许有人可以推荐其他好的解决方案?

I'm trying to create an application which will be basically a catalogue of my PDF collection. We are talking about 15-20GBs containing tens of thousands of PDFs. I am also planning to include a full-text search mechanism. I will be using Lucene.NET for search (actually, NHibernate.Search), and a library for PDF->text conversion. Which would be the best choice? I was considering these:

  • PDFBox
  • pdftotext (from xpdf) via c# wrapper
  • iTextSharp

Edit: Other good option seems to be using iFilters. How well (speed/quality) would they perform (Foxit/Adobe) in comparison to these libraries?

Commercial libraries are probably out of the question, as it is my private project and I don't really have a budget for commercial solutions - although PDFTextStream looks really nice.

From what I've read pdftotext is a lot faster than PDFBox. How well performs iTextSharp in comparison to pdftotext? Or maybe someone can recommend other good solutions?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清醇 2024-09-17 14:43:21

如果是私人项目,是否会进入持续的转换过程?例如,在转换 15-20Gb 后,您还会继续转换吗?

我之所以问这个问题是因为我想弄清楚速度是否是您的主要问题。例如,如果是我转换图书馆的书籍,我主要关心的是转换的质量,而不是速度。如果有必要的话,我总是可以在晚上/周末进行转换!

If it is for a private project, is this going to an ongoing conversion process? E.g. after you've converted the 15-20Gb are you going to still be converting?

The reason I ask is because I'm trying to work out whether speed is your primary issue. If it were me, for example, converting a library of books, my primary concern would be the quality of the conversion, not the speed. I could always leave the conversion over-night/-weekend if necessary!

终弃我 2024-09-17 14:43:21

Foxit PDF IFilter 桌面版是免费的

http://www.foxitsoftware.com/pdf/ifilter/

它会自动进行索引和搜索,但也许他们的索引也可供您使用。如果您打算在销售或分发的应用程序中使用它,那么我想这不会是一个好的选择,但如果它只是为了您自己,那么它可能会起作用。

Foxit代码是我公司PDF阅读器/文本提取库的核心,这不适合您的项目,但我可以保证底层 Foxit 引擎结果的速度和质量。

The desktop version of Foxit's PDF IFilter is free

http://www.foxitsoftware.com/pdf/ifilter/

It will automatically do the indexing and searching, but perhaps their index is available for you to use as well. If you are planning to use it in an application you sell or distribute, then I guess it won't be a good choice, but if it's just for yourself, then it might work.

The Foxit code is at the core my company's PDF Reader/Text Extraction library, which wouldn't be appropriate for your project, but I can vouch for the speed and quality of the results of the underlying Foxit engine.

2024-09-17 14:43:21

我想使用任何库都可以,但是您想在搜索时搜索所有这些 20Gb 文件吗?

对于全文搜索,最好是您可以创建一个数据库,例如 sqlite 或客户端计算机上的任何本地数据库,读取所有 pdf 并将它们转换为纯文本,并在首先添加它们时将其存储在数据库中。

您的数据库可以简单地如下..

Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....

您可以在需要时搜索该表,这样您的搜索将非常快,与 pdf 类型无关,而且仅当 pdf 添加到您的数据库时才需要从 pdf 到数据库的转换收集或修改。

I guess using any library is fine, but do you want to search all these 20Gb files at time of search?

For full text search, best is you can create a database, something like sqlite or any local database on client machine, read all pdf and convert them to plain text and store it in database when they are added first.

Your database can simpley be as following..

Table: PDFFiles
PDFFileID
PDFFilePath
PDFTitle
PDFAuthor
PDFKeywords
PDFFullText....

and you can search this table when you need to, this way your search will be extremely fast independent of type of pdf, plus this conversion from pdf to database is needed only when pdf is added to your collection or modified.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文