对尚未进行 OCR 处理的 PDF 进行批量 OCR 处理

发布于 2024-08-08 02:56:24 字数 111 浏览 4 评论 0原文

如果我有 10,000 个 PDF,其中一些已进行 OCRed,其中一些有 1 页已进行 OCRed,但其余页面尚未进行,我如何才能浏览所有 PDF 并且仅对尚未进行 OCRed 的页面进行 OCR完成了吗?

If I have 10,000 PDFs, some of which have been OCRed, some of which have 1 page that has been OCRed but the rest of the pages have not, how can I go through all the PDFs and only OCR the pages that haven't already been done?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

嗫嚅 2024-08-15 02:56:24

这正是我想要的,我有数千个扫描的 PDF 文件,其中有些已经经过 OCR 处理,有些还没有。

因此,我结合了在论坛和 Stack Overflow 上找到的信息,并制作了自己的解决方案,该解决方案完全做到了这一点,我在这里为您总结了:

  • 递归扫描所有子目录中的 PDF 文件;
  • 检查 PDF是否已经经过 OCR 处理,如果没有,请以您可以指定的语言使用 OCR 高质量处理 PDF;
  • 将 OCR PDF 就地保存为 PDF/A,并覆盖旧的(未进行 OCR 处理的)PDF。

我使用的是 Windows 10,无法找到明确的答案。我尝试使用 Acrobat Pro 执行此操作,但这给了我很多错误,并且 Acrobat 的批处理在出现每个错误或受密码保护的文件时都会停止。我还在 Windows 上尝试了许多其他批量 OCR 工具,但没有一个效果很好。
我花了无数个小时手动检查哪些文件在图像“下方”已经有文本层。

直到! Microsoft 宣布现在可以非常轻松地在 Windows 下、同一台机器上、同一文件系统上运行 Linux。
Linux 上可用的工具和实用程序比 Windows 多得多,所以我想尝试一下。

因此,步骤如下:

  1. 在 Windows 控制面板中启用 Linux 的 Windows 子系统;有很多指南。谷歌一下。就几分钟了。
  2. 从 Windows 应用商店安装 Linux。打开 Windows 应用商店,搜索 Ubuntu,然后安装。大约需要 5 分钟。
  3. 现在你有了“Ubuntu 应用程序”。运行它。它向您展示了 linux bash,以及通过 /mnt/c 对 Windows 文件进行文件访问。太神奇了!
  4. 你需要一些Linux“应用程序”,即pdffontsocrmypdf;您可以使用命令 sudo apt install pdffontssudo apt install ocrmypdf 进行安装。我们将使用这些应用程序检查 PDF 中是否有嵌入字体,如果没有,则对 PDF 进行 OCR。 (见下面的注释)。
  5. 将非常小的 bash 脚本(如下)安装到您的主目录〜。
  6. 转至 (cd) 保存所有 PDF 的目录。例如:/mnt/c/Users/name/OneDrive/Documents
  7. 运行命令: find 。 -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;

完成!

当然,运行它可能需要很长时间,具体取决于您有多少 PDF,以及其中有多少尚未进行 OCR 处理。

这是 sh 脚本。您应该将其保存在主文件夹中的某个位置,以便可以轻松地从任何地方调用。像这样:

  1. 输入cd ~。这将带您进入您的主文件夹。
  2. 输入pico pdf-ocr.sh。这将弹出一个编辑器。粘贴以下脚本代码。然后按 Ctrl+X,然后按 Y。您的文件现已保存。
  3. 输入 sudo chmod +x pdf-ocr.sh 。这将授予脚本运行权限。
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "Not yet OCR'ed: $1 -------- Processing...."
        echo " "
        ocrmypdf -l eng+deu+nld -s "$1" "$1"
        echo " "
else
    echo "Already OCR'ed: $1"
echo " "
fi

这是做什么的?

那么,find 命令会查找当前目录(包括子目录)中的所有 PDF 文件。然后,它将这些文件“发送”到脚本,其中 pdffonts 检查是否存在嵌入字体。如果是这样,请跳过该文件并尝试下一个文件。如果未找到嵌入字体,请使用 ocrmypdf 进行 OCR 处理。
我发现 ocrmypdf 的 OCR 质量非常好,甚至比 Acrobat 的还要好。您当然可以调整设置。例如,我可以想象您可能想要使用 eng+deu+nld 之外的其他语言进行 OCR。您可以在此处查找所有选项:https://ocrmypdf.readthedocs.io/en/latest/

注意:我在这里假设如果 PDF 文件没有嵌入字体(所以它基本上是一个图像(扫描) PDF 文件),它进行 OCR 处理。我知道这可能并不总是准确和/或真实的,但对我来说这足以确定哪些文件要通过 OCR。这样就没有必要重新制作成百上千个PDF文件了……

我知道在Windows下安装Linux有点麻烦,但如果你有基本的Linux技能,这是很容易做到的。对我来说,这是值得的,因为我现在已经制作了可以运行的“一键式”批处理器。我无法使用 Windows 工具找到解决方案。

我希望有人发现这个并发现它有用。如果有人有改进,请在这里发布。

谢谢。

乔斯·琼克伦

This is exactly what I was looking for, I have thousands of scanned PDF files, where some were already OCR'ed and some are not.

So, I combined information I found on fora and Stack Overflow, and made my own solution that does EXACTLY that, which I have summarized for you here:

  • scan through all subdirectories recursively for PDF files;
  • check if the PDF was already OCR'ed, and if not, process the PDF with OCR with high quality, in the language(s) you can specify;
  • save the OCR PDF in-place, as PDF/A, and overwriting the old (not-OCR'ed) one.

I am on Windows 10, and could not find the definitive answer. I tried doing this with Acrobat Pro, but that gave me many errors, and Acrobat's batch processing stops on every error or password-protected file. I also tried many other batch-OCR tools on Windows, but none worked well.
I spent countless hours manually checking which files already had a text-layer "under" the image.

UNTIL! Microsoft announced that it was now very easy to run Linux under Windows, on the same machine, on the same filesystem.
There are many more tools and utilities available on Linux than Windows, so I thought I would give that a try.

So, here it is, step by step:

  1. Enable the Windows subsystem for Linux in the Windows Control Panel; there are many guides. Google it. It's a couple of minutes.
  2. Install Linux from the Windows Store. Open the Windows Store, search for Ubuntu, and install. Takes around 5 minutes.
  3. Now you have the "Ubuntu app". Run it. It shows you the linux bash, and with file access to your Windows files through /mnt/c. It's magic!
  4. You need some Linux "apps", namely pdffonts and ocrmypdf; which you can install by using the command sudo apt install pdffonts and sudo apt install ocrmypdf. We will use these apps to check if there is an embedded font in a PDF, and if not, OCR the PDF. (see note below).
  5. Install the very small bash script (below) to your home directory ~.
  6. Go to (cd) the directory where all your PDF's are saved. For example: /mnt/c/Users/name/OneDrive/Documents.
  7. Run the command: find . -type f -name "*.pdf" -exec /your/homedir/pdf-ocr.sh '{}' \;

Done!

Running this might, of course, take a long time, depending on how many PDF's you have, and how many of those are not OCR'ed yet.

Here is the sh-script. You should save it somewhere in your home folder so that it is easy to call from anywhere. Like so:

  1. type cd ~. This will bring you to your home folder.
  2. type pico pdf-ocr.sh. This will bring up an editor. Paste the below script code. Then press Ctrl+X, and press Y. Your file is now saved.
  3. type sudo chmod +x pdf-ocr.sh. This will give the script permission to be run.
MYFONTS=$(pdffonts -l 5 "$1" | tail -n +3 | cut -d' ' -f1 | sort | uniq)
if [ "$MYFONTS" = '' ] || [ "$MYFONTS" = '[none]' ]; then
    echo "Not yet OCR'ed: $1 -------- Processing...."
        echo " "
        ocrmypdf -l eng+deu+nld -s "$1" "$1"
        echo " "
else
    echo "Already OCR'ed: $1"
echo " "
fi

What does this do?

Well, the find command looks up all PDF files in the current directory including subdirectories. It then "sends" these files to the script, in which pdffonts checks if there are embedded fonts. If so, skip the file and try the next one. If no embedded fonts are found, use ocrmypdf to do the OCR-ing.
I found the quality of OCR from ocrmypdf VERY good, even better than Acrobat's. You can of course tweak the settings. I can imagine for example that you might want to use other languages for OCR than eng+deu+nld. You can look up all options here: https://ocrmypdf.readthedocs.io/en/latest/

Note: I am making the assumption here that if a PDF file has no embedded fonts (so it's basically an image (scan) in a PDF-file), that it has not OCR'ed. I know that this might not always be accurate and/or true, but for me that is enough to determine which files to put through OCR. So that it is not neccesary to re-do hundreds or thousands of PDF files....

I know that it is a bit more hassle to install Linux under Windows, but as it is very easy to do if you have basic Linux skills. For me it was worth the effort because I now have made "one click" batch processor that works. I could not find a solution for that with Windows-tools.

I hope someone finds this and finds this useful. If anyone has improvements, please post them here.

Thanks.

Jos Jonkeren

兔姬 2024-08-15 02:56:24

为什么不重新 OCR 一切呢?你花在重复工作上的时间可能超过了工作本身所花费的时间。

Why don't you re-OCR everything? The amount of time you spend agonizing over repeated work probably exceeds the time taken for the work itself.

錯遇了你 2024-08-15 02:56:24

如果您所说的 OCRed 是指它们包含机器可读形式的文本,您可以使用 Apache PDFBox 等库来尝试从文档的第二页中提取文本。如果它抛出错误或返回垃圾,则很可能不是 ORed。

If by OCRed you mean that they contain the text in machine-readable form, you could use a library like Apache PDFBox to try to extract the text from the second page of the document. If it throws an error or returns garbage, it's most likely not OCRed.

晨曦慕雪 2024-08-15 02:56:24

解开这条线索。

您可以通过使用 pdffonts 进行测试来了解哪些 PDF 文件已被 ORed。如果有嵌入字体,则 PDF 很可能已经经过 OCRed 处理。

至于批处理,我写了一个小脚本,可以将 OCR 批处理为 pdf/word/excel/csv 输出格式。

您可以在 https://github.com/deajan/pmOCR 找到它
pmOCR(poor man's OCR 是适用于 Linux 或 Tesseract 3 开源解决方案的 Abbyy OCR CLI 的包装器)。

Unburying this thread.

You can know which PDF files have already been OCRed by testing them with pdffonts. If there are embedded fonts, it's very probable that the PDF is already OCRed.

As for the batch processing, I wrote a little script that can batch OCR to pdf/word/excel/csv output format.

You may find it at https://github.com/deajan/pmOCR
pmOCR (poor man's OCR is a wrapper for Abbyy OCR CLI for linux or Tesseract 3 open source solution).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文