OCR——最“不同”的识别或“可识别的” ASCII 字符?
我正在寻找一种方法来确定最“不同”或“可识别”的 N 个 ASCII 字符...例如,如果 N = 10,那么从 0x21 到 0x7E 的 ASCII 集中最不同的 N 个字符是什么?显然,字符“X”与“O”(字母)非常不同,但“O”(字母)与“0”(零)非常相似。假设一个受限制的 OCR 字符子集,这样零和字母 O 只会被检测为其中之一,并且不必担心它是零还是字母 O,那么最不同的 N 是什么?典型的 OCR 引擎(例如 Tesseract)可以轻松地从质量较差的输入图像中识别出哪些字符?假设。例如“+”和“t”很可能会被误认为彼此。可以这样制作,因此每个输入字符,无论是“+”还是“t”都只能对应其中一个。
谢谢, 本
I am looking for a way to determine the most "different" or "recognizable" N ASCII characters... For example, if N = 10, what would be the most different N characters in the ASCII set from 0x21 to 0x7E? Obviously, the character "X" is very different than "O" (the letter), but "O" (the letter) is very similar to "0" (zero). Assuming a restricted OCR character subset, such that zero and the letter O would be detected as one or the other only, and one didn't have to worry about whether it was a zero or a letter O, what would be the most different N characters that typical OCR engines (for example Tesseract) recognize easily from a poor quality input image? Assumptions. such as "+" and "t" could widely be mistaken for one another. can be made, and thus each input character, whether it's "+" or "t" would only correspond to one or the other.
Thanks,
Ben
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
不幸的是,我认为对此不会有一个唯一的答案。
这取决于字体:比较 0、f、s 的不同表示方式和风格。
这取决于字符在被扫描之前受到的损坏类型,有些可能更能抵抗污迹,另一些则更能抵抗剪切,另一些则更能抵抗过度书写。
如果您正在寻找最适合打印、扫描和 OCRed 的表示形式,那么一维或二维条形码可能是更好的选择?
Unfortunately I don't think there will be a single unique answer for this.
It'll depend on the font: Compare the different ways that 0, f, s are represented and also stylistic flourishes.
It'll depend on the type of damage the characters receive before being scanned, some may be more resilient against smudging, others against cuts, others against over-writing.
If you're looking for a representation that's best at surviving being printed, scanned and OCRed, then maybe a 1D or 2D barcode would be a better choice?
回答这个问题只有一种方法:测试它。为每个字母创建一组样本,并对每个样本运行 OCR。 OCR 最常正确识别的字母是最“可识别”的; OCR 最常出错的字母是最“不同”的。
Only one way to answer this question: test it. Create a set of samples for each letter, and run OCR on each sample. The letters that OCR gets right the most often are the most "recognizable"; the letters that OCR gets wrong most often are the most "different".