Tesseract OCR:如何找到每个返回字符的读取错误幅度?
我在 iPhone 应用程序中使用 Tesseract OCR 引擎从账单发票照片中读取特定的数字字段。 使用大量的照片预处理(自适应阈值处理、伪影清理等),结果最终相当准确,但仍有一些情况我想改进。
如果用户在弱光条件下拍照,并且图片中存在一些噪点或伪影,OCR 引擎会将这些伪影解释为附加数字。在某些后部情况下,它可以将数字金额“32,15”欧元读取为“5432,15”欧元,这对于最终用户对产品的信心来说根本没有好处。
我假设,如果存在与每个读取的字符相关的内部 OCR 引擎读取错误,则我之前示例中的“54”位数字会更高,因为它们是通过小噪声像素识别的,并且如果我有权访问通过这个读取错误值,我将能够轻松地丢弃错误的数字。
您知道有什么方法可以获取从 tesseract OCR 引擎返回的每个单独字符的读取误差大小(或任何“准确度因子”值)吗?
I 'm using Tesseract OCR engine in an iPhone application to read specific numeric fields from bill invoice photos.
Using a lot of photo pre-processing (adaptive thresholding, artifact cleaning, etc) the results are finally fairly accurate but there are still some cases I want to improve.
If the user takes a photo in low-light conditions and there is some noise or artifacts in the picture, the OCR engine interprets these artifacts as additional digits. In some rear cases it can read e.g. a numeric amount of "32,15" EUR as "5432,15" EUR and this is not at all good for the final user confidence in the product.
I assume that, if there is an internal OCR engine read-error associated to each character read, it will be higher on the "54" digits of my previous example as they are recognized over small noise-pixels, and if I had access to this reading-error values I will be able to easily discard the erroneous digits.
Do you know of any method to get a reading error magnitude (or any "accuracy factor" value) for each individual character returned from tesseract OCR engine?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在 Tesseract 术语中,它被称为“置信度”值。在tesseract-ocr Group中搜索该术语会发现许多提到 TesserracExtractResult 的答案方法。
hOCR 输出也包含此值。
It is called "confidence" value in Tesseract terminology. Search for that term in tesseract-ocr Group turned up many answers that mention about a TesserractExtractResult method.
The hOCR output also contains this value.