从扫描的文档/图像中可靠地提取身份字段?
我必须从纸质表格中取出两个预先打印的(不是手写的)字段,以便在扫描后可以自动路由。这些字段包含批次和项目标识符,例如“GG-9192”或“EPN/245G”。
我尝试过以下软件:
- Tesseract-OCR
- Cuneiform
- Canon ImageRunner 内置 OCR
- Asprise OCR Java API(演示)
我尝试过以下设置:
- 以 300dpi 和 600dpi 的分辨率扫描
- 尝试过不同的字体,包括 OCR-A 和 OCR -B。
在所有情况下,输出几乎都是到处都是。我可以退回那些我无法正确提取必要信息的文档,但我认为至少会有一半。我考虑了某种基于数据库中已知值的模糊逻辑,但有时这些标识符可能有一个字符不同,例如“123G”和“123C”。
这是一个失败的原因吗?也许 OCR 还不够成熟,无法处理这种性质的要求?您还可以推荐哪些其他技术?条形码?
编辑:包含的应用程序是用 Java 编写的,因此任何有免费或廉价的基于 Java 的 API 的建议都会有所帮助。
编辑 2:如果有人感兴趣...没有任何特殊的调整,Cuneiform for Linux 和 Canon ImageRunner 效果最好,Tesseect-OCR 和 Asprise Java API 产生最差的结果...除了标准之外,这四个都不可接受文档搜索级 OCR。我开始认为这不会成功。
I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
- Tesseract-OCR
- Cuneiform
- Canon ImageRunner built-in OCR
- Asprise OCR Java API (demo)
I've tried the following settings:
- Scanning at resolutions of 300dpi and 600dpi
- Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您可以控制这些字段,为什么首先要使用人类可读的格式呢?对于扫描来说,最好是使用二维码或类似的东西。它有方向标记,并具有一些内置的纠错功能。
http://en.wikipedia.org/wiki/QR_Code
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
我从番茄的建议开始挖掘产品。我尝试过ABBYY和CVISION。两者都有可以自动化 OCR 的产品:
此外,ABBYY 还拥有 ABBYY 识别服务器 2.0 abbyy.com/ocr_sdk/" rel="nofollow noreferrer">适用于各种平台的 SDK,CVISION 有一个SDK 似乎至少适用于 VB/VC++。
我还没有尝试过这两个 SDK,并且不确定它对于我的项目是否必要。我所需要的只是进来的 PDF,我可以从中提取文本。不过,我确实尝试了 CVISION 的服务器产品,并且在 OCR 的最准确设置下,它工作得非常好。我还没有尝试过 ABBYY 的服务器产品,因为我必须通过经销商才能试用。我正在这样做,但如果它开始变得烦人,我可能会选择 CVISION。我确实尝试过 ABBYY 的 FineReader 独立产品,它运行得很好,所以我认为他们的服务器产品也可以。
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.