如何使 tesseract 在存在噪声的情况下给出相关结果?
我正在使用 tesseract 3.0.0,遇到以下问题:
当有一些东西太小而无法识别时,它似乎与 其他片段。结果,没有返回任何相关内容。
下图展示了3个案例。只有带有虚线的矩形会传递给超正方体。矩形上方就是结果(V 上方 T 表示换行)。
最后一个案例是问题所在。在这种情况下有什么办法可以改进 tesseract 吗?
I am using tesseract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
据我所知,Tesseract 还没有适当的图像分割(或文档分析,因为它在商业 OCR 应用程序中被称为)。通常,在 OCR 完成之前,图像被分割为包含文本、图片、条形码的单独区域、线等。然后,您仅将 OCR 应用于文本区域,并且不会遇到您刚才描述的问题。
Tesseract 的早期版本根本不具备该功能,当您将 Tesseract 用于从大图像中剪切的小文本片段时,Tesseract 应该仅用作行识别器,或所谓的字段级识别器。
我没有完全遵循3.0中引入的内容,可能它已经部分存在,但显然它没有按预期工作,正如您刚刚发现的那样。
还有另一个开源项目 - OCRopus,它完全按照我所描述的方式解决了这个问题 - 首先是文档分析(又名分段),然后才是 OCR。他们的早期版本实际上是在 analisys 步骤完成后使用 Tesseract 进行 OCR。但后来他们推出了自己的 OCR(仍然不是很好),并将 Tesseract 插件支持移至优先级列表中。
您实际上可以采取以下措施来解决您的问题:
免责声明:我为 ABBYY 工作
As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
Disclaimer: I work for ABBYY