当前位置：文江博客话题详情

对尚未进行 OCR 处理的 PDF 进行批量 OCR 处理

发布于 2024-08-08 02:56:24 字数 111 浏览 4 评论 0原文

如果我有 10,000 个 PDF，其中一些已进行 OCRed，其中一些有 1 页已进行 OCRed，但其余页面尚未进行，我如何才能浏览所有 PDF 并且仅对尚未进行 OCRed 的页面进行 OCR完成了吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

嗫嚅 2024-08-15 02:56:24

这正是我想要的，我有数千个扫描的 PDF 文件，其中有些已经经过 OCR 处理，有些还没有。

因此，我结合了在论坛和 Stack Overflow 上找到的信息，并制作了自己的解决方案，该解决方案完全做到了这一点，我在这里为您总结了：

递归扫描所有子目录中的 PDF 文件；
检查 PDF是否已经经过 OCR 处理，如果没有，请以您可以指定的语言使用 OCR 高质量处理 PDF；
将 OCR PDF 就地保存为 PDF/A，并覆盖旧的（未进行 OCR 处理的）PDF。

我使用的是 Windows 10，无法找到明确的答案。我尝试使用 Acrobat Pro 执行此操作，但这给了我很多错误，并且 Acrobat 的批处理在出现每个错误或受密码保护的文件时都会停止。我还在 Windows 上尝试了许多其他批量 OCR 工具，但没有一个效果很好。
我花了无数个小时手动检查哪些文件在图像“下方”已经有文本层。

直到！ Microsoft 宣布现在可以非常轻松地在 Windows 下、同一台机器上、同一文件系统上运行 Linux。
Linux 上可用的工具和实用程序比 Windows 多得多，所以我想尝试一下。

因此，步骤如下：

在 Windows 控制面板中启用 Linux 的 Windows 子系统；有很多指南。谷歌一下。就几分钟了。
从 Windows 应用商店安装 Linux。打开 Windows 应用商店，搜索 Ubuntu，然后安装。大约需要 5 分钟。
现在你有了“Ubuntu 应用程序”。运行它。它向您展示了 linux bash，以及通过 /mnt/c 对 Windows 文件进行文件访问。太神奇了！
你需要一些Linux“应用程序”，即pdffonts和ocrmypdf；您可以使用命令 sudo apt install pdffonts 和 sudo apt install ocrmypdf 进行安装。我们将使用这些应用程序检查 PDF 中是否有嵌入字体，如果没有，则对 PDF 进行 OCR。（见下面的注释）。
将非常小的 bash 脚本（如下）安装到您的主目录〜。
转至 (cd) 保存所有 PDF 的目录。例如：/mnt/c/Users/name/OneDrive/Documents。
运行命令： find 。 -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;

完成！

当然，运行它可能需要很长时间，具体取决于您有多少 PDF，以及其中有多少尚未进行 OCR 处理。

这是 sh 脚本。您应该将其保存在主文件夹中的某个位置，以便可以轻松地从任何地方调用。像这样：

输入cd ~。这将带您进入您的主文件夹。
输入pico pdf-ocr.sh。这将弹出一个编辑器。粘贴以下脚本代码。然后按 Ctrl+X，然后按 Y。您的文件现已保存。
输入 sudo chmod +x pdf-ocr.sh 。这将授予脚本运行权限。

MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "Not yet OCR'ed: $1 -------- Processing...."
        echo " "
        ocrmypdf -l eng+deu+nld -s "$1" "$1"
        echo " "
else
    echo "Already OCR'ed: $1"
echo " "
fi

这是做什么的？

那么，find 命令会查找当前目录（包括子目录）中的所有 PDF 文件。然后，它将这些文件“发送”到脚本，其中 pdffonts 检查是否存在嵌入字体。如果是这样，请跳过该文件并尝试下一个文件。如果未找到嵌入字体，请使用 ocrmypdf 进行 OCR 处理。
我发现 ocrmypdf 的 OCR 质量非常好，甚至比 Acrobat 的还要好。您当然可以调整设置。例如，我可以想象您可能想要使用 eng+deu+nld 之外的其他语言进行 OCR。您可以在此处查找所有选项：https://ocrmypdf.readthedocs.io/en/latest/

注意：我在这里假设如果 PDF 文件没有嵌入字体（所以它基本上是一个图像（扫描） PDF 文件），它未进行 OCR 处理。我知道这可能并不总是准确和/或真实的，但对我来说这足以确定哪些文件要通过 OCR。这样就没有必要重新制作成百上千个PDF文件了……

我知道在Windows下安装Linux有点麻烦，但如果你有基本的Linux技能，这是很容易做到的。对我来说，这是值得的，因为我现在已经制作了可以运行的“一键式”批处理器。我无法使用 Windows 工具找到解决方案。

我希望有人发现这个并发现它有用。如果有人有改进，请在这里发布。

谢谢。

乔斯·琼克伦

This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.

So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:

scan through all subdirectories recursively for PDF files;
check if the PDF was already OCR'ed, and if not, process the PDF with OCR with high quality, in the language(s) you can specify;
save the OCR PDF in-place, as PDF/A, and overwriting the old (not-OCR'ed) one.

I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well.
I spent countless hours manually checking which files already had a text-layer "under" the image.

UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem.
There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.

So, here it is, step by step:

Enable the Windows subsystem for Linux in the Windows Control Panel; there are many guides. Google it. It's a couple of minutes.
Install Linux from the Windows Store. Open the Windows Store, search for Ubuntu, and install. Takes around 5 minutes.
Now you have the "Ubuntu app". Run it. It shows you the linux bash, and with file access to your Windows files through /mnt/c. It's magic!
You need some Linux "apps", namely pdffonts and ocrmypdf; which you can install by using the command sudo apt install pdffonts and sudo apt install ocrmypdf. We will use these apps to check if there is an embedded font in a PDF, and if not, OCR the PDF. (see note below).
Install the very small bash script (below) to your home directory ~.
Go to (cd) the directory where all your PDF's are saved. For example: /mnt/c/Users/name/OneDrive/Documents.
Run the command: find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;

Done!

Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.

Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:

type cd ~. This will bring you to your home folder.
type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.

MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "Not yet OCR'ed: $1 -------- Processing...."
        echo " "
        ocrmypdf -l eng+deu+nld -s "$1" "$1"
        echo " "
else
    echo "Already OCR'ed: $1"
echo " "
fi

What does this do?

Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing.
I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/

Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....

I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.

I hope someone finds this and finds this useful. If anyone has improvements, please post them here.

Thanks.

Jos Jonkeren

回复收藏 0 原文