对尚未进行 OCR 处理的 PDF 进行批量 OCR 处理
如果我有 10,000 个 PDF,其中一些已进行 OCRed,其中一些有 1 页已进行 OCRed,但其余页面尚未进行,我如何才能浏览所有 PDF 并且仅对尚未进行 OCRed 的页面进行 OCR完成了吗?
If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这正是我想要的,我有数千个扫描的 PDF 文件,其中有些已经经过 OCR 处理,有些还没有。
因此,我结合了在论坛和 Stack Overflow 上找到的信息,并制作了自己的解决方案,该解决方案完全做到了这一点,我在这里为您总结了:
我使用的是 Windows 10,无法找到明确的答案。我尝试使用 Acrobat Pro 执行此操作,但这给了我很多错误,并且 Acrobat 的批处理在出现每个错误或受密码保护的文件时都会停止。我还在 Windows 上尝试了许多其他批量 OCR 工具,但没有一个效果很好。
我花了无数个小时手动检查哪些文件在图像“下方”已经有文本层。
直到! Microsoft 宣布现在可以非常轻松地在 Windows 下、同一台机器上、同一文件系统上运行 Linux。
Linux 上可用的工具和实用程序比 Windows 多得多,所以我想尝试一下。
因此,步骤如下:
/mnt/c/Users/name/OneDrive/Documents
。find 。 -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;
完成!
当然,运行它可能需要很长时间,具体取决于您有多少 PDF,以及其中有多少尚未进行 OCR 处理。
这是 sh 脚本。您应该将其保存在主文件夹中的某个位置,以便可以轻松地从任何地方调用。像这样:
cd ~
。这将带您进入您的主文件夹。pico pdf-ocr.sh
。这将弹出一个编辑器。粘贴以下脚本代码。然后按 Ctrl+X,然后按 Y。您的文件现已保存。这是做什么的?
那么,
find
命令会查找当前目录(包括子目录)中的所有 PDF 文件。然后,它将这些文件“发送”到脚本,其中pdffonts
检查是否存在嵌入字体。如果是这样,请跳过该文件并尝试下一个文件。如果未找到嵌入字体,请使用 ocrmypdf 进行 OCR 处理。我发现 ocrmypdf 的 OCR 质量非常好,甚至比 Acrobat 的还要好。您当然可以调整设置。例如,我可以想象您可能想要使用
eng+deu+nld
之外的其他语言进行 OCR。您可以在此处查找所有选项:https://ocrmypdf.readthedocs.io/en/latest/注意:我在这里假设如果 PDF 文件没有嵌入字体(所以它基本上是一个图像(扫描) PDF 文件),它未进行 OCR 处理。我知道这可能并不总是准确和/或真实的,但对我来说这足以确定哪些文件要通过 OCR。这样就没有必要重新制作成百上千个PDF文件了……
我知道在Windows下安装Linux有点麻烦,但如果你有基本的Linux技能,这是很容易做到的。对我来说,这是值得的,因为我现在已经制作了可以运行的“一键式”批处理器。我无法使用 Windows 工具找到解决方案。
我希望有人发现这个并发现它有用。如果有人有改进,请在这里发布。
谢谢。
乔斯·琼克伦
This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.
So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:
I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well.
I spent countless hours manually checking which files already had a text-layer "under" the image.
UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem.
There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.
So, here it is, step by step:
/mnt/c/Users/name/OneDrive/Documents
.find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;
Done!
Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.
Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:
cd ~
. This will bring you to your home folder.pico pdf-ocr.sh
. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.sudo chmod +x pdf-ocr.sh
. This will give the script permission to be run.What does this do?
Well, the
find
command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in whichpdffonts
checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, useocrmypdf
to do the OCR-ing.I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than
eng+deu+nld
. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....
I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.
I hope someone finds this and finds this useful. If anyone has improvements, please post them here.
Thanks.
Jos Jonkeren
为什么不重新 OCR 一切呢?你花在重复工作上的时间可能超过了工作本身所花费的时间。
Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.
如果您所说的 OCRed 是指它们包含机器可读形式的文本,您可以使用 Apache PDFBox 等库来尝试从文档的第二页中提取文本。如果它抛出错误或返回垃圾,则很可能不是 ORed。
If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.
解开这条线索。
您可以通过使用 pdffonts 进行测试来了解哪些 PDF 文件已被 ORed。如果有嵌入字体,则 PDF 很可能已经经过 OCRed 处理。
至于批处理,我写了一个小脚本,可以将 OCR 批处理为 pdf/word/excel/csv 输出格式。
您可以在 https://github.com/deajan/pmOCR 找到它
pmOCR(poor man's OCR 是适用于 Linux 或 Tesseract 3 开源解决方案的 Abbyy OCR CLI 的包装器)。
Unburying this thread.
You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.
As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.
You may find it at https://github.com/deajan/pmOCR
pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).