从文本图像生成字体
是否可以生成特定的 下面给定图像的字体集 ?
我的想法是生成特定的字体 对于下面给出的文本图像,通过 手动选择部分 图像并将其映射到一组 letter's.为此生成字体 然后用这个字体来制作 可读的 OCR.Is 生成 可以使用任何开源字体 执行 ?还请大家推荐一下 有什么好的 OCR 吗?
Is it possible to generate a specific
set of font from the below given image
?My idea is to generate a specific font
for the below given image of text ,by
manually selecting portion of the
image and mapping it to a set of
letter's.Generate the font for this
and then use this font to make it
readable for an OCR.Is generation of
font possible using any open-source
implementation ? Also please suggest
any good OCR's.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
Abbyy FineReader 10 的结果比预期要好,但可以预见的是,当字符接触时会感到困惑。
你的问题是行距太小。每行的下降部分与正下方行中的字符的字符边界框重叠。这使得字符分割几乎不可能,因为字符是接触且重叠的。重叠字符的组合数量实际上是不可能训练的。 “g”和“y”字符是最糟糕的。
双行间距版本的 OCR 效果可能相当好。
分段和分离每一行的自定义解决方案以及一本好的字典肯定会改善结果。但仍然会有一些错误需要手动纠正。自定义例程必须处理上升部分和下降部分,并尝试将图像分割成线条,然后将其输入到合适的 OCR 引擎。一种方法是分析页面上的每个字符块并将其分配给一行。 Leptonica(www.leptonica.com - C 成像库)可能会让这项工作变得更容易一些。
如果不先将分辨率提高到 200 或 300 dpi,我不会尝试此操作。
有了这个定制解决方案,如果 OCR 引擎最初表现不佳,训练字体就成为一种选择。
Abbyy (www.abbyy.com) 或 Google Tesseract OCR 3.00 将是一个不错的起点。
但不能保证所有这些是否都会起作用。这对于 OCR 来说是一个相当困难的页面,您需要考虑是否最好在海外手动输入。这取决于需要处理的页面数量。
Abbyy FineReader 10 gets better than expected results but predictably gets confused when the characters touch.
Your problem is that the line spacing is too small. The descenders of each line overlap the character bounding boxes of the characters in the line directly below. This makes character segmentation almost impossible because the characters are touching and overlapping. The number of combinations of overlapping characters is virtually impossible to train for. The 'g' and 'y' characters are the worst offenders.
A double line spaced version of this would probably OCR reasonably well.
A custom solution that segmented and separated the each line along with a good dictionary would definitely improve the results. There would still be some errors to correct manually though. The custom routine would have to deal with the ascenders and descenders and try and segment the image into lines which can then be fed to a decent OCR engine. One way would be to analyse every character blob on the page and allocate it to a line. Leptonica (www.leptonica.com - C Imaging Library) would probably make this job a little easier.
I would not try this without increasing the resolution to 200 or 300 dpi first.
With this custom solution, training a font becomes an option if the OCR engine does a poor job initially.
Abbyy (www.abbyy.com) or Google Tesseract OCR 3.00 would be a good place to start.
No guarantees as to whether all of this will work though. This is quite a difficult page to OCR and you need to work out whether it is better to have it typed up manually overseas. It depends on the number of pages to need to process.