PDF 批量 OCR 程序

发布于 2024-11-08 11:38:09 字数 1539 浏览 3 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

独﹏钓一江月 2024-11-15 11:38:09

只是澄清您的一些误解...

“我没有 acrobat 的许可副本,所以我不知道如何将 10,000 个文件转换为 tiff。”

您可以转换 PDF借助免费(如自由)和免费(如啤酒)Ghostscript 转换为 TIFF。如果您想在 Linux Mint 或 Windows 7 上执行此操作,您可以选择。Linux 的命令行是:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
  input.pdf

“我不希望 10,000 个 30 页文档变成 30,000 个单独的 tiff 图像”

您可以“轻松制作多页”TIFF。上面的命令确实创建了 G4(传真 tiff)风格的 TIFF。如果您甚至想要单页 TIFF,您可以修改命令:

gs \
 -o input_page_%03d.tif \
 -sDEVICE=tiffg4 \
  input.pdf

输出文件名的 %03d 部分将自动转换为一系列 001002003 等。

注意事项:

  1. tiffg4 输出设备的默认分辨率为 204x196 dpi。您可能想要更好的价值。要获得 720 dpi,您应该将 -r720x720 添加到命令行。
  2. 另外,如果您的 Ghostscript 安装使用字母作为默认介质大小,您可能需要更改它。您可以使用 -gXxY 来设置设备点的宽度x高度。因此,要获得横向的 ISO A4 输出页面尺寸,您可以添加 -g8420x5950 参数。

因此,控制这两个参数以在 A4 上纵向生成 720 dpi 输出的完整命令如下:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
 -r720x720 \
 -g5950x8420 \
  input.pdf

Just to put some of your misconceptions straight...

" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."

You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
  input.pdf

"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"

You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:

gs \
 -o input_page_%03d.tif \
 -sDEVICE=tiffg4 \
  input.pdf

The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.

Caveats:

  1. The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
  2. Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.

So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:

gs \
 -o input.tif \
 -sDEVICE=tiffg4 \
 -r720x720 \
 -g5950x8420 \
  input.pdf
野生奥特曼 2024-11-15 11:38:09

我想我会尝试通过回答自己的问题来做出贡献(已经为自己编写了一些不错的代码,如果没有这个委员会的帮助就无法完成)。如果你在unix(好吧,对我来说是osx)中cat pdf文件,那么包含文本的pdf文件将在其中包含单词“Font”(作为字符串,但与其他文本混合在一起)b/c这就是文件告诉 Adob​​e 要显示哪些字体。

bash 中的 cat 命令似乎与 python 中以二进制模式读取文件具有相同的输出(打开文件时使用“rb”模式而不是“w”或“r”或“a”)。因此,我假设所有包含文本的 pdf 文件在二进制输出中都包含单词“Font”,并且只有图像的文件永远不会。如果情况总是如此,那么此代码将在单个目录中列出所有包含文本的 pdf 文件,并列出仅包含图像的文件。它将每个列表保存到单独的 .txt 文件中,然后您可以使用 bash 中的命令将 pdf 文件移动到适当的文件夹。

将它们放在自己的文件夹中后,您就可以仅对 images_only 文件夹中的 pdf 文件运行批量 ocr 解决方案。我还没有走到那一步(显然)。

    import os, re

    #path is the directory with the files, other 2 are the names of the files you will store your lists in

    path = 'C:/folder_with_pdfs'
    files_with_text = open('files_with_text.txt', 'a')
    image_only_files = open('image_only_files.txt', 'a')


    #have os make a list of all files in that dir for a loop
    filelist = os.listdir(path)

    #compile regular expression that matches "Font"
    mysearch = re.compile(r'.*Font.*', re.DOTALL)

    #loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
    #if they have "Font" they have text, if not they don't
    #(pdf does something to understand the Font type and uses this word every time the pdf contains text)
    for pdf in filelist:
        openable_file = os.path.join(path, pdf)
        cat_file = open(openable_file, 'rb')
        usable_cat_file = cat_file.read()
        #print usable_cat_file
        if mysearch.match(usable_cat_file):
            files_with_text.write(pdf + '\n')
        else:
            image_only_files.write(pdf + '\n')

为了移动文件,我在 bash shell 中输入了这个命令:

cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done 

另外,我没有重新运行上面的 python 代码,我只是手工编辑了这个东西,所以它可能有错误,我不知道。

Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.

The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.

Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).

    import os, re

    #path is the directory with the files, other 2 are the names of the files you will store your lists in

    path = 'C:/folder_with_pdfs'
    files_with_text = open('files_with_text.txt', 'a')
    image_only_files = open('image_only_files.txt', 'a')


    #have os make a list of all files in that dir for a loop
    filelist = os.listdir(path)

    #compile regular expression that matches "Font"
    mysearch = re.compile(r'.*Font.*', re.DOTALL)

    #loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
    #if they have "Font" they have text, if not they don't
    #(pdf does something to understand the Font type and uses this word every time the pdf contains text)
    for pdf in filelist:
        openable_file = os.path.join(path, pdf)
        cat_file = open(openable_file, 'rb')
        usable_cat_file = cat_file.read()
        #print usable_cat_file
        if mysearch.match(usable_cat_file):
            files_with_text.write(pdf + '\n')
        else:
            image_only_files.write(pdf + '\n')

To move the files, I entered this command in bash shell:

cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done 

Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.

梦萦几度 2024-11-15 11:38:09

这是一个有趣的问题。如果您愿意在 .NET 中的 Windows 上工作,您可以使用 dotImage 来完成此操作(免责声明,我在 Atalasoft 工作)并编写了大部分 OCR 引擎代码)。让我们将问题分解为几个部分 - 首先是迭代所有 PDF:

string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();

foreach (string path in candidatePDFs) {
    using (FileStream stm = new FileStream(path, FileMode.Open)) {
        if (decoder.IsValidFormat(stm)) {
            ProcessPdf(path, stm);
        }
    }
}

这将获取所有以 .pdf 结尾的文件的列表,如果该文件是有效的 pdf,则调用例程来处理它:

public void ProcessPdf(string path, Stream stm)
{
    using (Document doc = new Document(stm)) {
        int i=0;
        foreach (Page p in doc.Pages) {
            if (p.SingleImageOnly) {
                ProcessWithOcr(path, stm, i);
            }
            else {
                ProcessWithTextExtract(path, stm, i);
            }
            i++;
        }
    }
}

这将文件打开为一个 Document 对象,并询问每个页面是否仅是图像。如果是这样,它将 OCR 页面,否则它将提取文本:

public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
    using (Stream textStream = GetTextStream(path, page)) {
        PdfDecoder decoder = new PdfDecoder();
        using (AtalaImage image = decoder.Read(pdfStm, page)) {
            ImageCollection coll = new ImageCollection();
            coll.Add(image);
            ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
            OcrEngine engine = GetOcrEngine();
            engine.Initialize();
            engine.Translate(source, "text/plain", textStream);
            engine.Shutdown();
        }
    }
}

它的作用是将 PDF 页面光栅化为图像,并将其放入 engine.Translate 可接受的形式。这并不严格需要以这种方式完成 - 人们可以通过调用 Recognize 从 AtalaImage 引擎获取 OcrPage 对象,但随后将由客户端代码循环遍历该结构并写出文本。

您会注意到,我遗漏了 GetOcrEngine() - 我们提供 4 个 OCR 引擎供客户端使用:Tesseract、GlyphReader、RecoStar 和 Iris。您将选择最适合您需求的一种。

最后,您需要代码从已经具有完美文本的页面中提取文本:

public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
    using (Stream textStream = GetTextStream(path, page)) {
        StreamWriter writer = new StreamWriter(textStream);
        using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
            PdfTextPage page = doc.GetPage(i);
            writer.Write(page.GetText(0, page.CharCount));
        }
    }
}

这将从给定页面中提取文本并将其写入输出流。

最后,您需要 GetTextStream():

public Stream GetTextStream(string sourcePath, int pageNo)
{
    string dir = Path.GetDirectoryName(sourcePath);
    string fname = Path.GetFileNameWithoutExtension(sourcePath);
    string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
    return new FileStream(finalPath, FileMode.Create);
}

这会是 100% 的解决方案吗?不,当然不是。您可以想象包含单个图像并在其周围绘制方框的 PDF 页面 - 这显然会使仅图像测试失败,但不会返回任何有用的文本。也许更好的方法是仅使用提取的文本,如果没有返回任何内容,则尝试 OCR 引擎。从一种方法更改为另一种方法只需编写不同的谓词即可。

This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:

string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();

foreach (string path in candidatePDFs) {
    using (FileStream stm = new FileStream(path, FileMode.Open)) {
        if (decoder.IsValidFormat(stm)) {
            ProcessPdf(path, stm);
        }
    }
}

This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:

public void ProcessPdf(string path, Stream stm)
{
    using (Document doc = new Document(stm)) {
        int i=0;
        foreach (Page p in doc.Pages) {
            if (p.SingleImageOnly) {
                ProcessWithOcr(path, stm, i);
            }
            else {
                ProcessWithTextExtract(path, stm, i);
            }
            i++;
        }
    }
}

This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:

public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
    using (Stream textStream = GetTextStream(path, page)) {
        PdfDecoder decoder = new PdfDecoder();
        using (AtalaImage image = decoder.Read(pdfStm, page)) {
            ImageCollection coll = new ImageCollection();
            coll.Add(image);
            ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
            OcrEngine engine = GetOcrEngine();
            engine.Initialize();
            engine.Translate(source, "text/plain", textStream);
            engine.Shutdown();
        }
    }
}

what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.

You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.

Finally, you would need the code to extract text from the pages that already have perfectly good text on them:

public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
    using (Stream textStream = GetTextStream(path, page)) {
        StreamWriter writer = new StreamWriter(textStream);
        using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
            PdfTextPage page = doc.GetPage(i);
            writer.Write(page.GetText(0, page.CharCount));
        }
    }
}

This extracts the text from the given page and writes it to the output stream.

Finally, you need GetTextStream():

public Stream GetTextStream(string sourcePath, int pageNo)
{
    string dir = Path.GetDirectoryName(sourcePath);
    string fname = Path.GetFileNameWithoutExtension(sourcePath);
    string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
    return new FileStream(finalPath, FileMode.Create);
}

Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.

坏尐絯℡ 2024-11-15 11:38:09

最简单的方法是使用单一工具(例如 ABBYY FineReader、Omnipage 等)批量处理图像,而无需将它们分为扫描图像和未扫描图像。我相信 FineReader 无论如何都会在执行 OCR 之前将 PDF 转换为图像。

使用 OCR 引擎将为您提供自动倾斜校正、页面方向检测、图像阈值处理、去斑等功能。您必须购买这些功能才能使用这些功能并自行编程,并且可能很难找到一组最佳的10,000 个 PDF 的参数。

使用自动 OCR 方法会产生其他副作用,具体取决于输入图像,您会发现如果对图像进行排序并为每种类型的图像设置最佳参数,您会获得更好的结果。为了准确性,最好使用适当的 PDF 文本提取例程来提取具有完美文本的 PDF。

归根结底,这将取决于时间和金钱与您所需结果的质量。最终,商业 OCR 程序将是最快、最简单的解决方案。如果您只有纯文本文档,那么廉价的 OCR 程序将与昂贵的解决方案一样有效。您的文件越复杂,处理它们所需的费用就越多。

我会尝试找到一些商业 OCR 引擎的演示/试用版本,看看它们在不同文档类型上的表现,然后再花费太多时间和金钱。

The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.

Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.

Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.

At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.

I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.

剑心龙吟 2024-11-15 11:38:09

我为 Abbyy OCR4LINUX CLI 引擎(恕我直言,花费不多)和 Tesseract 3 编写了一个小包装器。

该包装器可以批量转换文件,例如:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory

该脚本使用 pdffonts 来确定 PDF 文件是否已被通过 ORed 来跳过它们。此外,该脚本可以作为系统服务来监视目录并在文件进入该目录时立即启动 OCR 操作。

脚本可以在这里找到:
https://github.com/deajan/pmOCR

希望这对某人有帮助。

I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.

The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory

The script uses pdffonts to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.

Script can be found here:
https://github.com/deajan/pmOCR

Hopefully, this helps someone.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文