如何识别需要OCR的PDF文件?
我有超过 30,000 个 pdf 文件。有些文件已经是 OCR,有些则不是。有没有办法找出哪些文件已经被 OCR 识别以及哪些 pdf 文件只是图像?
如果我通过 OCR 处理器运行每个文件,那将需要很长时间。
I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?
It will take for ever if I ran every single file through an OCR processor.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我会编写一个小脚本来从 PDF 文件中提取文本并查看它是否为“空”。如果有文本,则 PDF 已被 ORed。您可以使用 ghostscript 或 XPDF 提取文本。
编辑:
这应该可以帮助您入门:
不幸的是,即使您的 PDF 中只有图像,
pdftotext
也会提取一些文本,因此您必须做更多的工作来检查是否需要 OCR pdf。I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.
EDIT:
This should get you started:
Unfortunately even when you have only images in your PDF
pdftotext
will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.XPDF 以不同的方式为我工作。但不确定这是正确的方法。
我的带有图像的 PDF 也提供了文本内容。因此,我使用pdffonts.exe来验证字体是否嵌入在文档中。在我的例子中,所有图像文件的嵌入值都显示为“否”。
所有可搜索的 PDF 都给出“是”
XPDF worked for me in a different way. But not sure it is the right way.
My PDFs with image also gave text content. So I used pdffonts.exe to verify if the fonts are embedded in the document or not.In my case all image files showed 'no' for embedded value.
Where as all searchable PDFs gave 'yes'
我发现 TotalCmd 有一个插件可以处理这个问题:
https://totalcmd.net/plugring/pdfOCR.html
I found that TotalCmd has a plugin that handles this:
https://totalcmd.net/plugring/pdfOCR.html
以下脚本将递归查找需要 OCR 的文件。您需要从您最喜欢的来源获取
pdftotext
。我使用Cygwin来安装它。我使用以下脚本将需要 OCR 的文件移动到子文件夹中,以便可以从 Acrobat 执行批量 OCR。您可以使用您选择的命令行工具直接运行 OCR。
The following script will recursively find files that need OCR. You'll need to get
pdftotext
from your favorite source. I used Cygwin to install it.I used the following script to move files that need OCR into a subfolder so that I can perform batch OCR from Acrobat. You could instead run OCR directly with the command-line tool of your choice.
您可以使用桌面搜索工具“dtSearch”扫描文件夹或整个驱动器。扫描结束时,它将显示所有“仅图像”PDF 的列表。此外,它还会显示“加密”PDF 的列表(如果有)。
You can scan a folder or entire drive using desktop search tool "dtSearch". At the end of the scan, it will show the list of all "image only" PDFs. In addition, it will also show a list of "encrypted" PDFs if any.