操作指南:使用 C# 在 OCR 之前提高 PDF 质量
我正在创建一项监视文件夹中是否有扫描文件的服务。文件到达后,该服务就会拾取它,并将其转换为可读的 PDF。在此过程中,该服务还会搜索条形码。此后,文本被提取,文件及其文本被存储到我们软件的数据库中。该位置基于条形码。
现在,对于 OCR,我们使用 Atalasoft 的 SDK (http://www.atalasoft.com/) 。 该 SDK 中还包含条形码识别器。
但转换后的文本仍然存在一些错误。 (我用其他 OCR 程序进行了一些测试,但 Atalasoft 结果不错。) 我正在寻找一些软件(SDK 套件),它可以让我提高用于 OCR 目的的 PDF 质量。
我测试了 Kofax VRS Elite (http://www.kofax.com/vrs-virtualrescan/ )。我正在寻找类似的东西,但可以使用某种 SDK 套件在服务中实现。
有谁以前做过这个,或者遇到过类似的问题吗? 提前谢谢!
I'm creating a service that monitors a folder for scanned files. Once the file is there, The service picks it up, and convert it to a readable PDF. In this process the service also searches for a barcode. After this, the text is extracted and the file, with its text is stored into the database of our software. The location is based on the barcode.
Now, for the OCR we are using the SDK of Atalasoft (http://www.atalasoft.com/).
Also the Barcode recognizer is included in this SDK.
But the converted text still has some mistakes. (I ran some tests with other OCR-programs, but Atalasoft came out nice.)
I'm looking for some software (SDK-kit) which allows me to improve the quality of the PDF for OCR purposes.
I tested Kofax VRS Elite (http://www.kofax.com/vrs-virtualrescan/). I'm looking for something similar, but that can be implemented in the service using some kind of SDK-kit.
Anyone who did this before, or had similar problems?
thx in advance!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试完全遵循不同的路径:
查看是否可以将扫描仪配置为直接扫描为 PDF 并即时执行 OCR。 Lexmark 扫描仪可以做到这一点。这将创建带有可选择和可搜索文本的 PDF。这又可以通过 PDF 阅读库提取。
或者,您可能想查看 http://www.abbyy.com/ 并看看您是否得到更好的结果。
如果这些都不是好的选择,您可能需要系统地分解您的问题:
1. 扫描图像的图像质量是否有问题?如果是这样,那么必须首先解决这个问题。您的 OCR 解决方案可能会受到分辨率、对比度和颜色的影响。
2. 是OCR软件吗?拿一份清晰易读的文档,看看 OCR 软件是否出错。如果是这样,那么您就知道必须找到更好的 OCR 软件。
3. 如果您的文档质量不错,并且您的 OCR 软件在破译清晰文档方面的成功率很高,那么您可能需要查看不起作用的例外情况,并根据具体情况进行处理。
如果文档上的污点和背景图像是问题的原因,您可能需要研究避免这种情况的方法,或者使用公开 API 的图像处理软件来清理它。
You may try and follow a different path altogether:
See if you can configure the scanner(s) to scan directly to PDF and do the OCR on the fly. The Lexmark scanners can do this. This creates PDF's with selectable and searchable text. This in turn can be extracted with a PDF reading library.
Alternatively you may want to have a look at http://www.abbyy.com/ and see if you get better results.
If these are not good options, you may want to break down your problem in a systematic way:
1. Is the image quality of the scanned images the problem? If so, then this will have to be fixed first. Your OCR solution may be affected by resolution, contrast, and colour.
2. Is it the OCR software? Take a highly legible document and see if the OCR software makes mistakes. If so, then you know you have to find better OCR software.
3. If your document quality is decent and your OCR software has a high success rate in deciphering a legible document, then you may want to look at the exceptions that do not work, and tackle these on a case by case basis.
If smears and background images on documents is the cause of the problem, you may want to look into ways of avoiding this, or cleaning this with image processing software that exposes an API.