从 Python 应用程序使用 Tesseract OCR 的性能问题

发布于 2024-10-13 06:43:26 字数 1281 浏览 6 评论 0原文

我最近整合了一个界面，用于扫描可搜索文档并将其上传到我们的文档管理系统 KnowledgeTree。对于此过程的不同部分，我们可以使用大量单独的工具，但我想将所有内容合并到一个界面中，以使用户的操作变得简单。

这是平台：

#    OS: Ubuntu Desktop 10.04
#    GUI Toolkit: wxPython
#    OCR package: Tesseract 3.00 (compiled executable)

这是基本过程：

#    1. Retrieve individual page images from scanner
#    2. Call Tesseract OCR executable to produce HOCR data for each page
#    3. Run extracted words against English dictionary to guess if page orientation is correct
#        3a. If word matches are below threshold, rotate page 90 degrees and try again
#    4. Detect document type and retrieve metadata from HOCR data
#    5. Merge scanned pages and HOCR data into a finished PDF
#    6. Upload PDF and attached metadata to document management system through KnowledgeTree's API

它工作得很好，只是第 2 步在某些类型的文档上极其缓慢。它直接滚动浏览基本的固定宽度文本报告，但会在其中添加一些徽标、线条和其他不可读的内容，有时可能会在一个页面上花费几分钟的时间。更不用说如果它尝试重新定向，它可能会重复最多 4 次。相比之下，扫描仪附带的软件使用 ABBYY OCR，可以在不到一分钟的时间内处理 50 多个页面，几乎完美地处理页面布局和文本方向（我意识到这就是 ABBYY 花钱的原因）。不幸的是，使用此扫描软件对于用户来说比较复杂，并且仅涵盖步骤 1-3。

我的问题是我是否应该以不同的方式处理这个问题，也许通过将 OCR/上传与扫描界面完全分离，或者是否有任何我忽略的 OCR 包或其他解决方案可以集成到 Python 应用程序中。我调用外部应用程序来完成工作是否会导致性能问题？

无论我在这里做什么，控制步骤 4 都很重要，因为要求用户手动设置每个上传文档的类型和元数据可能会出现问题。

原文

I recently put together an interface for scanning and uploading searchable documents to KnowledgeTree, our document management system. We have access to plenty of separate tools for different parts of this process, but I wanted to combine everything into one interface to keep things simple for the users.

Here's the platform:

#    OS: Ubuntu Desktop 10.04
#    GUI Toolkit: wxPython
#    OCR package: Tesseract 3.00 (compiled executable)

And here is the basic process:

#    1. Retrieve individual page images from scanner
#    2. Call Tesseract OCR executable to produce HOCR data for each page
#    3. Run extracted words against English dictionary to guess if page orientation is correct
#        3a. If word matches are below threshold, rotate page 90 degrees and try again
#    4. Detect document type and retrieve metadata from HOCR data
#    5. Merge scanned pages and HOCR data into a finished PDF
#    6. Upload PDF and attached metadata to document management system through KnowledgeTree's API

It works beautifully, except that step 2 is extremely slow on certain types of documents. It rolls right through basic fixed-width text reports, but throw a few logos, lines, and other unreadable content in there, and it can sometimes spend minutes on a single page. Not to mention the fact that it could repeat that up to 4 times if it tries to reorient it. In comparison, the software packaged with the scanner uses ABBYY OCR, and can crunch 50+ pages in less than a minute, taking care of page layout and text orientation almost perfectly (I realize that's why ABBYY costs money). Unfortunately, using this scanning software is more complex for the users, and only covers steps 1-3 on its own.

My question is whether I should be approaching this differently, maybe by separating the OCR/upload from the scanning interface completely, of if there are any OCR packages or other solutions I'm overlooking that could be integrated into a Python application. Would the fact that I'm calling an external application to do the work cause performance issues?

Whatever I do here, it's important that I have control over step 4, since requiring the users to manually set the type and metadata for each uploaded document could be a problem.

分享到QQ

分享到微博