OCR 库,可以将 OCR 文本插入源 PDF
是否有一个库(或可执行文件)可以 OCR PDF(通常是通过扫描纸张创建的 PDF),并将识别的文本重新注入 PDF?可能是扫描图像后面的隐形文本。
最好是开源的。
(目标:我有一个由 Lucene 索引的庞大 PDF 文件库。如果 PDF 包含文本,Lucene 会更容易找到相关的 PDF。)
Is there a library (or executable) that can OCR a PDF (typically a PDF created by scanning a paper), and inject the recognized text back into the PDF? Probably as invisible text behind the scanned images.
Preferably open source.
(Goal: I have a huge library of PDF files indexed by Lucene. It would be much easier for Lucene to find what PDFs are relevant if the PDFs contained text.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最好的选择之一可能是使用 Abbyy FineReader,因为它会为您提供很多选项,包括创建隐藏文本。 www.abbyy.com 我快速浏览了他们的网站,还发现了他们的 Transformer 产品,该产品可能更适合您的需求。
http://www.abbyy.com.au/pdftransformer/product_features/
One of the best options is to probably use Abbyy FineReader as it will give you lots of options including the creation of hidden text. www.abbyy.com I had a quick look at their site and also came across their Transformer product which is probably even more suitable for your needs.
http://www.abbyy.com.au/pdftransformer/product_features/
如果 PDF 不包含文本,Lucene 会索引什么?
看一下 Docsplit (https://github.com/documentcloud/docsplit),它可以使用 Tesseract 来执行 OCR。您将获得一个纯文本文件,它反映了 PDF 的内容。然后,您可以在这些文本文件之上构建 Lucene 索引,并将对 PDF 的引用存储在 Lucene 索引中。查询 Lucene 索引后,您将获得包含原始 PDF 引用的文档列表。
If PDFs doesn't contain text, what is indexed by Lucene?
Take a look at Docsplitt (https://github.com/documentcloud/docsplit) it can use Tesseract to perform OCR. You will get a plain text files, which reflects the content of PDFs. You can than build your Lucene index on top of these text files and store reference to PDF in Lucene index. After querying Lucene index you will get the list of Documents with references to original PDFs.