如何识别需要OCR的PDF文件?

发布于 2024-12-09 05:04:07 字数 121 浏览 0 评论 0原文

我有超过 30,000 个 pdf 文件。有些文件已经是 OCR,有些则不是。有没有办法找出哪些文件已经被 OCR 识别以及哪些 pdf 文件只是图像?

如果我通过 OCR 处理器运行每个文件,那将需要很长时间。

I have over 30,000 pdf files. Some files are already OCR and some are not. Is there a way to find out which files are already OCR'd and which pdfs are image only?

It will take for ever if I ran every single file through an OCR processor.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

我一直都在从未离去 2024-12-16 05:04:07

我会编写一个小脚本来从 PDF 文件中提取文本并查看它是否为“空”。如果有文本,则 PDF 已被 ORed。您可以使用 ghostscriptXPDF 提取文本。

编辑:
这应该可以帮助您入门:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

不幸的是,即使您的 PDF 中只有图像,pdftotext 也会提取一些文本,因此您必须做更多的工作来检查是否需要 OCR pdf。

I would write a small script to extract the text from the PDF files and see if it is "empty". If there is text the PDF already was OCRed. You could either use ghostscript or XPDF to extract the text.

EDIT:
This should get you started:

foreach ($pdffile in get-childitem -filter *.pdf){
    $pdftext=invoke-expression ("\path\to\xpdf\pdftotext.exe '"+$pdffile.fullname+"' -");
    write-host $pdffile.fullname
    write-host $pdftext.length;
    write-host $pdftext;
    write-host "-------------------------------";
}

Unfortunately even when you have only images in your PDF pdftotext will extract some text, so you will have to do some more work to check whether you need to OCR the pdf.

じ违心 2024-12-16 05:04:07

XPDF 以不同的方式为我工作。但不确定这是正确的方法。

我的带有图像的 PDF 也提供了文本内容。因此,我使用pdffonts.exe来验证字体是否嵌入在文档中。在我的例子中,所有图像文件的嵌入值都显示为“否”。

> Config Error: No display font for 'Symbol' 
> Config Error: No display font for 'ZapfDingbats' 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- --------- 
> Helvetica                            Type 1            no  no  no       7  0

所有可搜索的 PDF 都给出“是”

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri                       TrueType          yes yes no       7  0
> ABCDEE+Calibri,Bold                  TrueType          yes yes no       9  0

XPDF worked for me in a different way. But not sure it is the right way.

My PDFs with image also gave text content. So I used pdffonts.exe to verify if the fonts are embedded in the document or not.In my case all image files showed 'no' for embedded value.

> Config Error: No display font for 'Symbol' 
> Config Error: No display font for 'ZapfDingbats' 
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- --------- 
> Helvetica                            Type 1            no  no  no       7  0

Where as all searchable PDFs gave 'yes'

> Config Error: No display font for 'Symbol'
> Config Error: No display font for 'ZapfDingbats'
> name                                 type              emb sub uni object ID
> ------------------------------------ ----------------- --- --- --- ---------
> ABCDEE+Calibri                       TrueType          yes yes no       7  0
> ABCDEE+Calibri,Bold                  TrueType          yes yes no       9  0
转身泪倾城 2024-12-16 05:04:07

我发现 TotalCmd 有一个插件可以处理这个问题:
https://totalcmd.net/plugring/pdfOCR.html

pdfOCR 是 wdx 插件,可以发现 PDF 文件有多少页
当前目录需要字符识别(OCR),即有多少个
PDF 文件中的页面布局中没有可搜索的文本。这是
当人们为自己的文档准备 PDF 文件时最需要它
或归档系统。通常在工作中需要处理 PDF 文件
之前从扫描版本转换为文本可搜索形式
它们包含在任何文档中,以允许手动或
自动文本搜索。 Total Commander 的 pdfOCR 插件实现了
通过呈现图像页数来满足图书馆员的需求
仅不包含任何文字。显示扫描页数
在“需要OCR”栏中。通过比较 needOCR 页数与
可以决定 PDF 文件是否需要的总页数
额外的 OCR 处理。

I found that TotalCmd has a plugin that handles this:
https://totalcmd.net/plugring/pdfOCR.html

pdfOCR is wdx plugin that discovers how many pages of PDF file in
current directory needs character recognition (OCR), i.e. how many
pages in PDF file have no searchable text in their layout. This is
mostly needed when one is preparing PDF files for one’s documentation
or archiving system. Generally in one’s work with PDF files they need
to be transformed from scanned version to text searchable form before
they are included in any documentation to allow for manual or
automatic text search. The pdfOCR plugin for Total Commander fulfils a
librarian’s need by presenting the number of pages that are images
only with no text contained. The number of scanned pages are presented
in the column “needOCR”. By comparing the needOCR number of pages with
the number of total pages one can decide if a PDF file needs
additional OCR processing.

叹梦 2024-12-16 05:04:07

以下脚本将递归查找需要 OCR 的文件。您需要从您最喜欢的来源获取pdftotext。我使用Cygwin来安装它。

#!/bin/bash
find . -name "*.pdf" | while read file; do
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
echo $file
fi
done

我使用以下脚本将需要 OCR 的文件移动到子文件夹中,以便可以从 Acrobat 执行批量 OCR。您可以使用您选择的命令行工具直接运行 OCR。

#!/bin/bash
mkdir ocr
for file in *.pdf; do
echo $file
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
mv "$file" ocr
fi
done

The following script will recursively find files that need OCR. You'll need to get pdftotext from your favorite source. I used Cygwin to install it.

#!/bin/bash
find . -name "*.pdf" | while read file; do
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
echo $file
fi
done

I used the following script to move files that need OCR into a subfolder so that I can perform batch OCR from Acrobat. You could instead run OCR directly with the command-line tool of your choice.

#!/bin/bash
mkdir ocr
for file in *.pdf; do
echo $file
if [ -z "$(pdftotext "$file" - | sed 's/\s//g')" ]; then
mv "$file" ocr
fi
done
美煞众生 2024-12-16 05:04:07

您可以使用桌面搜索工具“dtSearch”扫描文件夹或整个驱动器。扫描结束时,它将显示所有“仅图像”PDF 的列表。此外,它还会显示“加密”PDF 的列表(如果有)。

You can scan a folder or entire drive using desktop search tool "dtSearch". At the end of the scan, it will show the list of all "image only" PDFs. In addition, it will also show a list of "encrypted" PDFs if any.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文