We don’t allow questions seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
只是澄清您的一些误解...
“我没有 acrobat 的许可副本,所以我不知道如何将 10,000 个文件转换为 tiff。”
您可以转换 PDF借助免费(如自由)和免费(如啤酒)Ghostscript 转换为 TIFF。如果您想在 Linux Mint 或 Windows 7 上执行此操作,您可以选择。Linux 的命令行是:
“我不希望 10,000 个 30 页文档变成 30,000 个单独的 tiff 图像”
您可以“轻松制作多页”TIFF。上面的命令确实创建了 G4(传真 tiff)风格的 TIFF。如果您甚至想要单页 TIFF,您可以修改命令:
输出文件名的
%03d
部分将自动转换为一系列001
、002
、003
等。注意事项:
tiffg4
输出设备的默认分辨率为 204x196 dpi。您可能想要更好的价值。要获得 720 dpi,您应该将-r720x720
添加到命令行。-g8420x5950
参数。因此,控制这两个参数以在 A4 上纵向生成 720 dpi 输出的完整命令如下:
Just to put some of your misconceptions straight...
" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."
You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:
"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"
You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:
The
%03d
part of the output filename will automatically translate into a series of001
,002
,003
etc.Caveats:
tiffg4
output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add-r720x720
to the commandline.-gXxY
to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a-g8420x5950
parameter.So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:
我想我会尝试通过回答自己的问题来做出贡献(已经为自己编写了一些不错的代码,如果没有这个委员会的帮助就无法完成)。如果你在unix(好吧,对我来说是osx)中cat pdf文件,那么包含文本的pdf文件将在其中包含单词“Font”(作为字符串,但与其他文本混合在一起)b/c这就是文件告诉 Adobe 要显示哪些字体。
bash 中的 cat 命令似乎与 python 中以二进制模式读取文件具有相同的输出(打开文件时使用“rb”模式而不是“w”或“r”或“a”)。因此,我假设所有包含文本的 pdf 文件在二进制输出中都包含单词“Font”,并且只有图像的文件永远不会。如果情况总是如此,那么此代码将在单个目录中列出所有包含文本的 pdf 文件,并列出仅包含图像的文件。它将每个列表保存到单独的 .txt 文件中,然后您可以使用 bash 中的命令将 pdf 文件移动到适当的文件夹。
将它们放在自己的文件夹中后,您就可以仅对 images_only 文件夹中的 pdf 文件运行批量 ocr 解决方案。我还没有走到那一步(显然)。
为了移动文件,我在 bash shell 中输入了这个命令:
另外,我没有重新运行上面的 python 代码,我只是手工编辑了这个东西,所以它可能有错误,我不知道。
Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.
The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.
Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).
To move the files, I entered this command in bash shell:
Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.
这是一个有趣的问题。如果您愿意在 .NET 中的 Windows 上工作,您可以使用 dotImage 来完成此操作(免责声明,我在 Atalasoft 工作)并编写了大部分 OCR 引擎代码)。让我们将问题分解为几个部分 - 首先是迭代所有 PDF:
这将获取所有以 .pdf 结尾的文件的列表,如果该文件是有效的 pdf,则调用例程来处理它:
这将文件打开为一个 Document 对象,并询问每个页面是否仅是图像。如果是这样,它将 OCR 页面,否则它将提取文本:
它的作用是将 PDF 页面光栅化为图像,并将其放入 engine.Translate 可接受的形式。这并不严格需要以这种方式完成 - 人们可以通过调用 Recognize 从 AtalaImage 引擎获取 OcrPage 对象,但随后将由客户端代码循环遍历该结构并写出文本。
您会注意到,我遗漏了 GetOcrEngine() - 我们提供 4 个 OCR 引擎供客户端使用:Tesseract、GlyphReader、RecoStar 和 Iris。您将选择最适合您需求的一种。
最后,您需要代码从已经具有完美文本的页面中提取文本:
这将从给定页面中提取文本并将其写入输出流。
最后,您需要 GetTextStream():
这会是 100% 的解决方案吗?不,当然不是。您可以想象包含单个图像并在其周围绘制方框的 PDF 页面 - 这显然会使仅图像测试失败,但不会返回任何有用的文本。也许更好的方法是仅使用提取的文本,如果没有返回任何内容,则尝试 OCR 引擎。从一种方法更改为另一种方法只需编写不同的谓词即可。
This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:
This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:
This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:
what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.
You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.
Finally, you would need the code to extract text from the pages that already have perfectly good text on them:
This extracts the text from the given page and writes it to the output stream.
Finally, you need GetTextStream():
Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.
最简单的方法是使用单一工具(例如 ABBYY FineReader、Omnipage 等)批量处理图像,而无需将它们分为扫描图像和未扫描图像。我相信 FineReader 无论如何都会在执行 OCR 之前将 PDF 转换为图像。
使用 OCR 引擎将为您提供自动倾斜校正、页面方向检测、图像阈值处理、去斑等功能。您必须购买这些功能才能使用这些功能并自行编程,并且可能很难找到一组最佳的10,000 个 PDF 的参数。
使用自动 OCR 方法会产生其他副作用,具体取决于输入图像,您会发现如果对图像进行排序并为每种类型的图像设置最佳参数,您会获得更好的结果。为了准确性,最好使用适当的 PDF 文本提取例程来提取具有完美文本的 PDF。
归根结底,这将取决于时间和金钱与您所需结果的质量。最终,商业 OCR 程序将是最快、最简单的解决方案。如果您只有纯文本文档,那么廉价的 OCR 程序将与昂贵的解决方案一样有效。您的文件越复杂,处理它们所需的费用就越多。
我会尝试找到一些商业 OCR 引擎的演示/试用版本,看看它们在不同文档类型上的表现,然后再花费太多时间和金钱。
The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.
Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.
Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.
At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.
I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.
我为 Abbyy OCR4LINUX CLI 引擎(恕我直言,花费不多)和 Tesseract 3 编写了一个小包装器。
该包装器可以批量转换文件,例如:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory
该脚本使用
pdffonts
来确定 PDF 文件是否已被通过 ORed 来跳过它们。此外,该脚本可以作为系统服务来监视目录并在文件进入该目录时立即启动 OCR 操作。脚本可以在这里找到:
https://github.com/deajan/pmOCR
希望这对某人有帮助。
I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.
The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory
The script uses
pdffonts
to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.Script can be found here:
https://github.com/deajan/pmOCR
Hopefully, this helps someone.