改进 OCR/图像识别的预处理
目前我对图像处理和光学字符识别非常感兴趣。经过一些基本的认识和一些过滤后,我决定开始一些更困难的事情。
我正在尝试读取这些验证码的值: http://img851.imageshack.us/img851/9579/57859946.png
我编写了一些用于预处理的过滤器:
- 替换颜色(变为白色) 删除蓝线 删除穿过文本的线条(两个)
- 阈值图像(255)
输出这样的图像; http://img232.imageshack.us/img232/2325/00i3q45j1zt.png
正如您所看到的,某些字母上有洞。我一开始想也许最好把线条留在字母上,但这让情况变得更糟。我正在使用 tesseract OCR 引擎 我使用 Elephant 字体(验证码使用的字体)对其进行训练。我也尝试过 使用其他 OCR 引擎(如 GOCR),但这会让一切变得更糟。有了 tesseract,我现在的识别率达到了 20%。我正在使用 C# (.NET 4.0) 进行编码。
验证码由名为 PHPCaptcha 的软件包生成。
现在我的问题是: 是否有任何算法或标记来填补字母中的漏洞?还有其他方法可以得到更好的认可吗?
我很高兴收到你们的
来信
Currently I'm having a huge interest in image processing and optical character recognition. After some basic recognition and some filters I decided to start on something more difficult.
I'm trying to read the value out of these captchas:
http://img851.imageshack.us/img851/9579/57859946.png
I have written some filters for pre-processing:
- Replace Color (to White)
Remove blue lines
remove the lines that go through the text (two) - Threshold image (255)
Which outputs an images like this;
http://img232.imageshack.us/img232/2325/00i3q45j1zt.png
As you can see there are holes in some letters. I first thought maybe it's better to leave the lines through the letters but that made it worse. I'm using the tesseract OCR engine
and I trained it using the Elephant font (The font the captcha uses). I also tried
using other OCR engines like GOCR but it makes everything worse. With tesseract I now have a recognition of 20%. I'm coding in C# (.NET 4.0).
The captcha is generated by a software package named PHPCaptcha.
Now my question is:
Is there any algorithm or tick to fill up the holes in the letters? And is there any other way to get a better recognition?
I'm excited to hear from you guys
Greetings
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
第 0 部分 - 前言
i) 在此之前,您可能需要阅读我的 OCR 相关答案这里,这可能会给你一些使用正方体的技巧
ii) 我假设你可以将所有内容变成黑白(在您的情况下,彩色处理不会给您带来优势)
第 1 部分 - 预处理
在删除蓝线后填补“漏洞”。您始终可以膨胀或执行“先膨胀然后腐蚀”操作。在这里,膨胀意味着在 8 个方向上放大每个像素(使像素更大)。扩大像素后,看看是否可以识别它们,或者看看字符是否“过度填充”(扩大太多)。如果无法识别字符或字符膨胀太多,则可以应用腐蚀操作。当然有先进的合成算法,但我认为你最好先从更简单的图像处理操作开始。
第 2 部分 - OCR/Tesseract
使用 Tesseract,如果您将整个图像输入 Tesseract,它会执行线条分析等等。由于验证码中的字符的行为与普通文本不同,因此进行行分析或在组中识别它们可能会在一定程度上降低识别率。所以我的建议是先逐字识别。
Part 0 - Preface
i) Before hand, you may want read to my OCR-related answer here, which may give you some tricks for using tesseract
ii) I assume you could just turn everything into black and white (in your case, processing in colors doesn't give you an edge)
Part 1 - Preprocessing
To fill 'the-holes' after you've removed the blue lines. You can always dilate or perform 'dilate-then-erode' operations. Here, dilation means you enlarge every pixel in 8-directions(making a bigger pixel). Once you've dilated the pixels, see if you can get them to be recognized or see if the characters are 'over-filled' (dilated too much). If the chars cannot be recognized or the characters are dilated too much, you can then apply a erosion operation. Of course there are advanced synthesis algorithms, but i think you are better off to start with a simpler image processing operation first.
Part 2 - OCR/Tesseract
With Tesseract, if you are feeding the whole image into Tesseract, it would perform line analysis and so on and so forth. Since characters in captcha dont behave like normal text, doing line analysis or recognizing them in a group may somewhat deteoriate the recognition rate. So my suggestion is to recognize by character-by-character first.