当前位置：文江博客话题详情

如何识别需要OCR的PDF文件？

发布于 2024-12-09 05:04:07 字数 121 浏览 0 评论 0原文

我有超过 30,000 个 pdf 文件。有些文件已经是 OCR，有些则不是。有没有办法找出哪些文件已经被 OCR 识别以及哪些 pdf 文件只是图像？

如果我通过 OCR 处理器运行每个文件，那将需要很长时间。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我一直都在从未离去 2024-12-16 05:04:07

我会编写一个小脚本来从 PDF 文件中提取文本并查看它是否为“空”。如果有文本，则 PDF 已被 ORed。您可以使用 ghostscript 或 XPDF 提取文本。

编辑：
这应该可以帮助您入门：

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

不幸的是，即使您的 PDF 中只有图像，pdftotext 也会提取一些文本，因此您必须做更多的工作来检查是否需要 OCR pdf。

I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.

EDIT:
This should get you started:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

回复收藏 0 原文

じ违心 2024-12-16 05:04:07

XPDF 以不同的方式为我工作。但不确定这是正确的方法。

我的带有图像的 PDF 也提供了文本内容。因此，我使用pdffonts.exe来验证字体是否嵌入在文档中。在我的例子中，所有图像文件的嵌入值都显示为“否”。

> Config Error: No display font for 'Symbol' 
> Config Error: No display font for 'ZapfDingbats' 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- --------- 
> Helvetica                            Type 1            no  no  no       7  0

所有可搜索的 PDF 都给出“是”

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri                       TrueType          yes yes no       7  0
> ABCDEE+Calibri,Bold                  TrueType          yes yes no       9  0

XPDF worked for me in a different way. But not sure it is the right way.

My PDFs with image also gave text content. So I used pdffonts.exe to verify if the fonts are embedded in the document or not.In my case all image files showed 'no' for embedded value.

> Config Error: No display font for 'Symbol' 
> Config Error: No display font for 'ZapfDingbats' 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- --------- 
> Helvetica                            Type 1            no  no  no       7  0

Where as all searchable PDFs gave 'yes'

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri                       TrueType          yes yes no       7  0
> ABCDEE+Calibri,Bold                  TrueType          yes yes no       9  0

回复收藏 0 原文

转身泪倾城 2024-12-16 05:04:07

我发现 TotalCmd 有一个插件可以处理这个问题：
https://totalcmd.net/plugring/pdfOCR.html

pdfOCR 是 wdx 插件，可以发现 PDF 文件有多少页
当前目录需要字符识别（OCR），即有多少个
PDF 文件中的页面布局中没有可搜索的文本。这是
当人们为自己的文档准备 PDF 文件时最需要它
或归档系统。通常在工作中需要处理 PDF 文件
之前从扫描版本转换为文本可搜索形式
它们包含在任何文档中，以允许手动或
自动文本搜索。 Total Commander 的 pdfOCR 插件实现了
通过呈现图像页数来满足图书馆员的需求
仅不包含任何文字。显示扫描页数
在“需要OCR”栏中。通过比较 needOCR 页数与
可以决定 PDF 文件是否需要的总页数
额外的 OCR 处理。

回复收藏 0 原文

叹梦 2024-12-16 05:04:07

以下脚本将递归查找需要 OCR 的文件。您需要从您最喜欢的来源获取pdftotext。我使用Cygwin来安装它。

#!/bin/bash
find . -name "*.pdf" | while read file; do
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
echo $file
fi
done

我使用以下脚本将需要 OCR 的文件移动到子文件夹中，以便可以从 Acrobat 执行批量 OCR。您可以使用您选择的命令行工具直接运行 OCR。

#!/bin/bash
mkdir ocr
for file in *.pdf; do
echo $file
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
mv "$file" ocr
fi
done

The following script will recursively find files that need OCR. You'll need to get pdftotext from your favorite source. I used Cygwin to install it.

#!/bin/bash
find . -name "*.pdf" | while read file; do
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
echo $file
fi
done

I used the following script to move files that need OCR into a subfolder so that I can perform batch OCR from Acrobat. You could instead run OCR directly with the command-line tool of your choice.

#!/bin/bash
mkdir ocr
for file in *.pdf; do
echo $file
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
mv "$file" ocr
fi
done

回复收藏 0 原文

美煞众生 2024-12-16 05:04:07

您可以使用桌面搜索工具“dtSearch”扫描文件夹或整个驱动器。扫描结束时，它将显示所有“仅图像”PDF 的列表。此外，它还会显示“加密”PDF 的列表（如果有）。

回复收藏 0 原文

~没有更多了~

关于作者

弥繁

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

如何识别需要OCR的PDF文件？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

醉城メ夜风

远昼

平生欢

微凉

Honwey

qq_ikhFfg

友情链接

如何识别需要OCR的PDF文件？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

醉城メ夜风

远昼

平生欢

微凉

Honwey

qq_ikhFfg

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。