通过 OCR 从 T 恤照片中提取代码
我最近看到有人穿着一件 T 恤,背面有一些 Perl 代码。我拍了一张照片并剪掉了代码:
接下来我尝试通过 OCR 从图像中提取代码,所以我安装了 Tesseract OCR 及其 Python 绑定,pytesser。
Pytesser 仅适用于 TIFF 图像,因此我在 Gimp 中转换图像并输入以下代码(Ubuntu 9.10):
>>> from pytesser import *
>>> image = Image.open('code.tif')
>>> print image_to_string(image)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pytesser.py", line 30, in image_to_string
util.image_to_scratch(im, scratch_image_name)
File "util.py", line 7, in image_to_scratch
im.save(scratch_image_name, dpi=(200,200))
File "/usr/lib/python2.6/dist-packages/PIL/Image.py", line 1406, in save
save_handler(self, fp, filename)
File "/usr/lib/python2.6/dist-packages/PIL/BmpImagePlugin.py", line 197, in _save
raise IOError("cannot write mode %s as BMP" % im.mode)
IOError: cannot write mode RGBA as BMP
>>> r,g,b,a = image.split()
>>> img = Image.merge("RGB", (r,g,b))
>>> print image_to_string(img)
Tesseract Open Source OCR Engine
éi _ l_` _ t
’ ‘" fY`
{ W IKQW
· __·_ ‘ ·-»·
:W Z
·· I A n 1
;f
` `
`T .' V _ ‘
I {Z.; » ;,. , ; y i- 4 : %:,,
`· » V; ` ?
‘,—·.
H***li¥v·•·}I§¢ ` _ »¢is5#__·¤G$++}§;“»‘7·
71 ’ Q { NH IQ
ytéggygi { ;g¤qg;gm·;,g(g,,3) {3;;+-
§ {Jf**$d$ }‘$p•¢L#d¤ Sc}
» i ` i A1:
这显然是 OCR 引擎产生的乱码。所以,我的问题是:
- 我需要做什么才能从 Tesseract 中获得更好的 OCR 结果?
- 或者,其他人是否有更好的运气以其他方式从上图中提取代码?
I recently saw someone with a T-shirt with some Perl code on the back. I took a photograph of it and cropped out the code:
Next I tried to extract the code from the image via OCR, so I installed Tesseract OCR and the Python bindings for it, pytesser.
Pytesser only works on TIFF images, so I converted the image in Gimp and entered the following code (Ubuntu 9.10):
>>> from pytesser import *
>>> image = Image.open('code.tif')
>>> print image_to_string(image)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pytesser.py", line 30, in image_to_string
util.image_to_scratch(im, scratch_image_name)
File "util.py", line 7, in image_to_scratch
im.save(scratch_image_name, dpi=(200,200))
File "/usr/lib/python2.6/dist-packages/PIL/Image.py", line 1406, in save
save_handler(self, fp, filename)
File "/usr/lib/python2.6/dist-packages/PIL/BmpImagePlugin.py", line 197, in _save
raise IOError("cannot write mode %s as BMP" % im.mode)
IOError: cannot write mode RGBA as BMP
>>> r,g,b,a = image.split()
>>> img = Image.merge("RGB", (r,g,b))
>>> print image_to_string(img)
Tesseract Open Source OCR Engine
éi _ l_` _ t
’ ‘" fY`
{ W IKQW
· __·_ ‘ ·-»·
:W Z
·· I A n 1
;f
` `
`T .' V _ ‘
I {Z.; » ;,. , ; y i- 4 : %:,,
`· » V; ` ?
‘,—·.
H***li¥v·•·}I§¢ ` _ »¢is5#__·¤G$++}§;“»‘7·
71 ’ Q { NH IQ
ytéggygi { ;g¤qg;gm·;,g(g,,3) {3;;+-
§ {Jf**$d$ }‘$p•¢L#d¤ Sc}
» i ` i A1:
That's clearly gibberish that comes out of the OCR engine. So, my question is:
- What do I have to do to get better OCR results out of Tesseract?
- Or, does anybody else have better luck extracting the code from the above image in another way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您输入的速度可能比清理图像和安装 OCR 引擎的速度还要快:
编辑:拼写错误。
You can probably type faster than you can clean up images and install OCR engines:
Edit: typo.
预处理肯定会产生更可行的图像。
例如,下面是图像上 Gimp“Levels”、“Difference-of-Gaussians”和“Levels”过滤器的结果。
pre-processing will definitely yield a more workable image.
For example, here is the result of Gimp "Levels", "Difference-of-Gaussians", and "Levels" filters on the image.
RedDwight 代码中只有一些小拼写错误。
执行时会产生:
Just a few small typos in RedDwight code.
that when executed produces:
如果我是你,我会首先使用图片处理程序(例如 GIMP)尽可能多地清理图像,以便 OCR 的输入更容易理解。
如果可能的话,力求创建纯黑白图像。
If I were you I'd start by cleaning up the image as much as possible, using a picture-manipulation program (GIMP, for example) so that the input for the OCR would be more easily understandable.
If possible, aim for creating a black-and-white only image.
嗯,也许你需要处理图像,即通过一些过滤器,如“边缘检测”、浮雕/雕刻或噪声过滤器......
Hmm perhaps you need to process the image, i.e. put it though some filters like 'edge detection', emboss/engrave or a noise filter...
良好的 OCR 受到自然语言冗余的强烈指导,以产生“下一个字符可能是什么”的子集。 Perl 代码没有为 OCR 提供此类帮助。用手输入。
Good OCRs are strongly guided by redundancies in natural languages to yield a subset for "what might be the next character". Perl code gives no such aid to the OCR. Type it in by hand.
此类任务的关键是利用明显的限制。找到一个可以让您指定自己的字符集的库。要求主 DNA 螺旋中的所有字符都是 ATG C 之一。要求整个内容解析为 perl。如有必要,请手动输入较难的部分。
The key for a task like this is to take advantage of the evident constraints. Find a library that lets you specify your own character set. Require all the characters in the main DNA helices to be one of A T G C. Require that the whole thing parse as perl. Type in the hard parts by hand if necessary.