从 Python 应用程序使用 Tesseract OCR 的性能问题
我最近整合了一个界面,用于扫描可搜索文档并将其上传到我们的文档管理系统 KnowledgeTree。对于此过程的不同部分,我们可以使用大量单独的工具,但我想将所有内容合并到一个界面中,以使用户的操作变得简单。
这是平台:
# OS: Ubuntu Desktop 10.04
# GUI Toolkit: wxPython
# OCR package: Tesseract 3.00 (compiled executable)
这是基本过程:
# 1. Retrieve individual page images from scanner
# 2. Call Tesseract OCR executable to produce HOCR data for each page
# 3. Run extracted words against English dictionary to guess if page orientation is correct
# 3a. If word matches are below threshold, rotate page 90 degrees and try again
# 4. Detect document type and retrieve metadata from HOCR data
# 5. Merge scanned pages and HOCR data into a finished PDF
# 6. Upload PDF and attached metadata to document management system through KnowledgeTree's API
它工作得很好,只是第 2 步在某些类型的文档上极其缓慢。它直接滚动浏览基本的固定宽度文本报告,但会在其中添加一些徽标、线条和其他不可读的内容,有时可能会在一个页面上花费几分钟的时间。更不用说如果它尝试重新定向,它可能会重复最多 4 次。相比之下,扫描仪附带的软件使用 ABBYY OCR,可以在不到一分钟的时间内处理 50 多个页面,几乎完美地处理页面布局和文本方向(我意识到这就是 ABBYY 花钱的原因)。不幸的是,使用此扫描软件对于用户来说比较复杂,并且仅涵盖步骤 1-3。
我的问题是我是否应该以不同的方式处理这个问题,也许通过将 OCR/上传与扫描界面完全分离,或者是否有任何我忽略的 OCR 包或其他解决方案可以集成到 Python 应用程序中。我调用外部应用程序来完成工作是否会导致性能问题?
无论我在这里做什么,控制步骤 4 都很重要,因为要求用户手动设置每个上传文档的类型和元数据可能会出现问题。
I recently put together an interface for scanning and uploading searchable documents to KnowledgeTree, our document management system. We have access to plenty of separate tools for different parts of this process, but I wanted to combine everything into one interface to keep things simple for the users.
Here's the platform:
# OS: Ubuntu Desktop 10.04
# GUI Toolkit: wxPython
# OCR package: Tesseract 3.00 (compiled executable)
And here is the basic process:
# 1. Retrieve individual page images from scanner
# 2. Call Tesseract OCR executable to produce HOCR data for each page
# 3. Run extracted words against English dictionary to guess if page orientation is correct
# 3a. If word matches are below threshold, rotate page 90 degrees and try again
# 4. Detect document type and retrieve metadata from HOCR data
# 5. Merge scanned pages and HOCR data into a finished PDF
# 6. Upload PDF and attached metadata to document management system through KnowledgeTree's API
It works beautifully, except that step 2 is extremely slow on certain types of documents. It rolls right through basic fixed-width text reports, but throw a few logos, lines, and other unreadable content in there, and it can sometimes spend minutes on a single page. Not to mention the fact that it could repeat that up to 4 times if it tries to reorient it. In comparison, the software packaged with the scanner uses ABBYY OCR, and can crunch 50+ pages in less than a minute, taking care of page layout and text orientation almost perfectly (I realize that's why ABBYY costs money). Unfortunately, using this scanning software is more complex for the users, and only covers steps 1-3 on its own.
My question is whether I should be approaching this differently, maybe by separating the OCR/upload from the scanning interface completely, of if there are any OCR packages or other solutions I'm overlooking that could be integrated into a Python application. Would the fact that I'm calling an external application to do the work cause performance issues?
Whatever I do here, it's important that I have control over step 4, since requiring the users to manually set the type and metadata for each uploaded document could be a problem.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您遇到的问题是 Tesseract 是 OCR 引擎,而不是页面布局分析软件。 tesseract 网站表示 3.0 版可能会包含页面布局分析。
我知道在以前的版本中,只有当只有一列文本时,它才能很好地响应。
我认为您需要加入步骤 1.5,进行一些布局分析并尝试查找图像、徽标、难以辨认的文本块。
您可能需要查看 OCRfeeder,了解他的解决方案。
The problem you are having is that Tesseract is an OCR engine, not page layout analysis software. The tesseract website says that version 3.0 will probably include page layout analysis.
I know in previous versions it only responds well if there is a single column of text.
I think you need to put in a step 1.5 that would do some layout analysis and try to find blocks of images, logos, illegible text.
You might want to look at OCRfeeder, to see his solution.