Python Tesseract 无法识别这种字体
我有这个图像:
我想使用 python 将其读取为字符串,我没想到会那么难。我发现了 tesseract,然后是使用 tesseract 的 python 脚本的包装器。
所以我开始阅读图像,效果很好,直到我尝试阅读这张图像。我是否需要训练它来读取特定的字体?关于特定字体是什么有什么想法吗?或者有没有更好的 ocr 引擎我可以与 python 一起使用来完成这项工作。
编辑:也许我可以围绕数字制作某种矢量,然后以更大的尺寸重新绘制它们?图像越大,tesseract ocr 似乎读取它们的效果越好(毫不奇怪,哈哈)。
I have this image:
I want to read it to a string using python, which I didn't think would be that hard. I came upon tesseract, and then a wrapper for python scripts using tesseract.
So I started reading images, and it's done great until I tried to read this one. Am i going to have to train it to read that specific font? Any ideas on what that specific font is? Or is there a better ocr engine I could use with python to get this job done.
Edit: Perhaps I could make some sort of vector around the numbers, then redraw them in a larger size? The larger images are the better tesseract ocr seems to read them (no surprise lol).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
只需针对 10 位数字和“.”训练引擎即可。应该可以做到这一点。并确保在 OCR 之前将图像更改为灰度。
Just train the engine for the 10 digits and a '.' . That should do it. And make sure you change your image to grayscale before OCRing it.
训练是艰苦的,并不是这里真正需要的。无论脚本如何,O 和 0 以及 l 和 1 之间的区别都将很困难。如果上下文允许,将 OCR 限制为仅在数字之间进行选择可以大大简化问题。
我对超正方体的兴趣在于处理旧政府报告中的大量数字。在本例和所讨论的情况下,字符集将类似于“0123456789”。根据 eric_taj 在 2007 年 3 月 21 日在 tesseract 的旧 (sourceforge) 新闻组中发表的评论,您可以修改classify/intproto.cpp 中的 Templates->IndexFor 和 Templates->ClassIdFor 以屏蔽掉不适合的字符。被允许。我对该方法进行了一些修改,以便在运行时在环境变量中读取允许的字符集,以便我可以动态调整允许的字符集。
Training is hard and is not what is really needed here. The distinction between O and 0 and l and 1 are going to be hard, no matter the script. Limiting the OCR to choose only between numerical digits greatly simplifies the problem, if the context permits it.
My interest in tesseract is in processing lots of numbers, from old government reports. In this case and in the case in question, the character set will be something like '0123456789.' Following a comment in the old (sourceforge) newsgroup for tesseract, by eric_taj on 2007-03-21, you can modify Templates->IndexFor and Templates->ClassIdFor in classify/intproto.cpp to mask off characters which are not to be allowed. I modified that approach a bit to read in the allowed character set at runtime in an environment variable, so that I can adjust the permitted set on the fly.
tesseract OCR 讨论组中有大量关于此主题的流量最近。您将需要使用仅由数字组成的“语言”。许多人以前都以这种方式训练过引擎。看起来你正试图智胜验证码数据保护方案......啧,啧。
There has been a lot of traffic on this topic in the tesseract OCR discussion group lately. You will need to use a "language" of just numbers. Many people have trained the engine that way before. It looks like you're trying to outwit a captcha data protection scheme... tsk, tsk.
看起来像 Eurostile 字体。是的,您必须使用源图像中使用的每种不同字体进行训练。
That looks like Eurostile font. Yes, you will have to train with each different font that is being used in your source images.
对于通用 OCR 来说,识别小屏幕字体可能很困难,因为通用 OCR 已针对读取从纸张扫描的大而平滑的字体进行了优化。
您最好尝试特殊的屏幕截图 OCR,例如
Textract SDK。它将收集所有本地字体,并通过简单地匹配字符来提供 100% 精确的识别。
Recognizing small screen font may be hard for the general-purpose OCR which is optimized for reading large smooth font scanned from paper.
You may better try special screenshot OCR like
Textract SDK. It will collect all local fonts and provide 100% precise recognition by simply matching character to character.