哪个模块可以有效地一次性解析 .pdf 文件? CAM::PDF 或 PDF::API2
我想从一个巨大的 pdf 文件 [50MB] 中提取所有关键字? 哪个模块适合解析大型 pdf 文件? 我关心解析大文件和内存的问题。提取几乎所有关键词! 这里我想要 SAX 类型的解析 [一次性解析] &不是 DOM 类型的[类似于 XML]。
I want to extract all the keywords from a huge pdf file [50MB] ?
which module is good for large pdf files to parse ?
I'm concerned with memory for parsing huge file & extracting almost all the keywords !
Here i want SAX kind of parsing [one go parsing ] & not DOM kind of [ analogy to XML].
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
要从 PDF 中读取文本,我们使用 CAM::PDF ,效果很好。对于一些较大的文件来说,它的速度不是很快,但是处理大文件的能力还不错。我们当然有一些大约 100Mb 的数据,并且处理得很好。如果我记得的话,我们在 32 位 (Windows) Perl 上处理一些 130Mb 的数据时遇到了困难,但当时内存中还有很多其他数据。我们确实研究了
PDF::API2
,但它似乎更倾向于生成 PDF 并从中读取。我们没有将大文件放入PDF::API2
,因此我无法给出真正的基准数据。我们发现使用
CAM::PDF
的唯一显着缺点是 PDF 1.6 变得越来越普遍,而这在 CAM::PDF 中根本不起作用。这对您来说可能不是问题,但可能值得考虑。在回答你的问题时,我很确定两个模块都以一种或另一种形式将整个源 PDF 读入内存,但我不认为 CAM::PDF 构建了那么多更复杂的结构它的。所以两者都不是真正的 SAX,但 CAM::PDF 似乎总体上更轻,并且一次可以检索一页,因此可能会减少提取非常大的文本的负载。
To read text out of a PDF, we use
CAM::PDF
, and it worked just fine. It wasn't hugely fast on some larger files, but the ability to handle large files was not bad. We certainly had a few that were ~100Mb, and which were handled OK. If I recall, we struggled with a few that were 130Mb on a 32-bit (Windows) Perl, but we had a whole lot of other stuff in memory at the time. We did look atPDF::API2
, but it seemed more oriented to generating PDFs that reading from them. We didn't throw large files intoPDF::API2
, so I can't give a real benchmark figure.The only significant downside we found with using
CAM::PDF
is that PDF 1.6 is becoming more common, and that doesn't work at all in CAM::PDF yet. That might not be an issue for you, but it might be something to consider.In answer to your question, I'm pretty sure both modules read the whole source PDF into memory in one form or another, but I don't think
CAM::PDF
builds as many more complex structures out of it. So neither is really SAX-like, butCAM::PDF
seemed to be lighter in general, and can retrieve one page at a time, so might reduce the load for extracting very large texts.