基于多个 OCR 软件包输出来改进 OCR 结果的软件
是否有一款现有的商业或学术软件可以
- 叠加多个 OCR 软件包(Abbyy FineReader、Adobe Acrobat Professional、ReadIris 等)的结果,
- 基于多个来源积累的知识提供全自动改进
- ,允许使用额外的外部运行时的工具设置(词典、批量网络/本地语料库查找等)
?
注意:我已经有了内部解决方案来可视化来自单一来源的结果,所以如果没有这样的软件可用,我不介意开发自己的:)合作询问也将是最多的欢迎!
(来源:sourceforge.net)
Is there an already-existing piece of commercial or academic software that can
- overlay results from multiple OCR packages (Abbyy FineReader, Adobe Acrobat Professional, ReadIris, etc.)
- provide fully automated improvements based on accumulated knowledge from multiple sources
- allow for use of additional external tools setup at runtime (dictionieres, batch web / local corpus look-ups etc.)
?
Note: I already have in-house solutions to visualize results from single sources, so in case there is no such software obtainable, I would not mind developing my own : ) Inquiries for cooperation would then also be most welcome!
(source: sourceforge.net)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在多个 OCR 引擎之间使用投票的想法并不新鲜。问题是它并没有真正发挥作用。如果它们是本质上正交的简单分类器,那么您将结合他们的投票并改进结果,这可能会起作用。但它们都是非常复杂的软件,使用非常相似的一组众所周知的方法,几乎没有差异,但可能以不同的方式组合它们,有些实现更好,有些实现更差。
经验表明,当您结合多种 OCR 技术时,最佳决策规则是依赖最准确的一项技术的结果,而忽略其他技术。根据我的经验(我为 ABBYY 工作),ABBYY OCR 绝对是您提到的最准确的。
据我所知,使用投票的唯一原因是当您需要交叉检查“可疑”字符并将其发送到手动验证(如果需要 100% 的准确性)时。使用这种方法可以增加要验证的字符数量,但会降低错过错误字符的可能性。
The idea to use voting between several OCR engines is not new. The thing is that it is not really working. What probably would work if they would be simple classifiers ortogonal by thier nature, then you would combine their votes and improve results. But they all are very complicated software, using quite similar set of well-known approches with little variances, but probably combining them different way and some implementations are better and some are worse.
Experience shows that when you combine several OCR technologies, the best decision rule is to rely on results of most accurate one and just ingore others. From my experience (I work for ABBYY), ABBYY OCR is definetely the most accurate from ones you mentioned.
As far as I know, the only reason to use voting is when you want cross-check "suspicious" characters and send them to manual verification if 100% accuracy is a requirement. Using this approach you increase number of characters to verify, but reduce possibility to miss wrong character.
我之前使用过并推荐两种选择。
它是一个商业产品,使用多个 OCR 引擎并投票给确定最佳结果。仅限机器打印。上次我使用它时,他们有 6 个引擎。联系亚历克斯·达尔。
我在一个每天扫描 20,000 多页的大型项目中使用了它。
RecoStar采用投票方式,可以手印和机印。
There are two options that I have worked with previously and would recommend.
It is a commercial offering that uses multiple OCR engines and voting to determine the best result. It is machine print only. Last time I used it they had 6 engines. Contact Alex Dahl.
I have used it in a major project scanning 20,000+ pages per day.
RecoStar uses voting and can do handprint and machineprint.