使用 OCR 分离单词中相连字母的典型方法是什么
我对 OCR 非常陌生,对用于识别单词的算法几乎一无所知。我对此才刚刚熟悉。
有人可以建议用于识别和分隔连接形式的单个字符的典型方法吗(我的意思是所有字母都链接在一起的单词)?忘记手写,假设字母使用已知字体连接在一起,确定单词中每个单独字符的最佳方法是什么?当字符单独书写时没有问题,但是当它们连接在一起时,我们应该知道每个字符在哪里开始和结束,以便进行下一步并将它们单独匹配到一个字母。 有没有已知的算法?
I am very new to OCR and almost know nothing about the algorithms used to recognize words. I am just getting familiar to that.
Could anybody please advise on the typical method used to recognize and separate individual characters in connected form (I mean in a word where all letters are linked together)? Forget about handwriting, supposing the letters are connected together using a known font, what is the best method to determine each individual character in a word? When characters are written separately there is no problem, but when they are joined together, we should know where every single character starts and ends in order to go to the next step and match them individually to a letter.
Is there any known algorithm for that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这个过程的标准术语是“字符分割”——分割是图像处理术语,用于将图像分成分组区域以进行识别。 “阿拉伯字符分割”在如果您想了解更多信息,请访问谷歌学术。
我建议您查看Tesseract - 一种开源 OCR 实现,尤其是文档。
词汇表中定义的功能对此有一些介绍,但这里有大量信息。
基本上,Tesseract 通过查看 blob 来解决问题(来自 Tesseract 的工作原理)(不是字母),然后将这些斑点组合成单词。这避免了您所描述的问题,同时又产生了新的问题。
对于阿拉伯语(正如您所指出的),Tesseract 不起作用。我对这个领域不太了解,但是 本文似乎暗示动态时间扭曲(DTW) ) 是一种有用的技术。这会尝试拉伸单词以将其与已知单词相匹配,并且再次在单词而不是字母空间中起作用。
The standard term for this process is "character segmentation" - segmentation is the image processing term for breaking images into grouped areas for recognition. "Arabic character segmentation" throws up a lot of hits in google scholar if you want to learn more.
I'd encourage you to look at Tesseract - an open source OCR implementation, especially the documents.
Feature as defined in the glossary has a bit on this, but there is a ton of information here.
Basically Tesseract solves the problem (from How Tesseract Works) by looking at blobs (not letters) then combining those blobs into words. This avoids the problem you describe, while creating new problems.
For arabic (as you point out) Tesseract doesn't work. I don't know much about this area but this paper seems to imply Dynamic Time Warping (DTW) is a useful technique. This tries to stretch the words to match them to known words, and again works in word rather than letter space.