如何使 tesseract 在存在噪声的情况下给出相关结果?

发布于 2024-10-16 07:56:44 字数 261 浏览 10 评论 0原文

我正在使用 tesseract 3.0.0,遇到以下问题:

当有一些东西太小而无法识别时,它似乎与 其他片段。结果,没有返回任何相关内容。

下图展示了3个案例。只有带有虚线的矩形会传递给超正方体。矩形上方就是结果(V 上方 T 表示换行)。

最后一个案例是问题所在。在这种情况下有什么办法可以改进 tesseract 吗?

在此处输入图像描述

I am using tesseract 3.0.0 and I bumped into the following problem:

When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.

The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).

The last case is the problem one. Is there someway to improve tesseract in situations like this?

enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

水溶 2024-10-23 07:56:44

据我所知,Tesseract 还没有适当的图像分割(或文档分析,因为它在商业 OCR 应用程序中被称为)。通常,在 OCR 完成之前,图像被分割为包含文本、图片、条形码的单独区域、线等。然后,您仅将 OCR 应用于文本区域,并且不会遇到您刚才描述的问题。

Tesseract 的早​​期版本根本不具备该功能,当您将 Tesseract 用于从大图像中剪切的小文本片段时,Tesseract 应该仅用作行识别器,或所谓的字段级识别器。

我没有完全遵循3.0中引入的内容,可能它已经部分存在,但显然它没有按预期工作,正如您刚刚发现的那样。

还有另一个开源项目 - OCRopus,它完全按照我所描述的方式解决了这个问题 - 首先是文档分析(又名分段),然后才是 OCR。他们的早期版本实际上是在 analisys 步骤完成后使用 Tesseract 进行 OCR。但后来他们推出了自己的 OCR(仍然不是很好),并将 Tesseract 插件支持移至优先级列表中。

您实际上可以采取以下措施来解决您的问题:

  • 如果您的图像具有非常典型的结构,您可以尝试进行一些愚蠢的分割并自己从图像中剪切文本,然后再将其传递给 Tesseract。但是,如果您希望支持多种图像,那就算了。
  • 您可以检查 OCRopus 并查看它们的分割是否适用于您的图像。如果是,那么你可以花一些时间让 OCRopus + Tesseract 一起工作。
  • 好吧,如果您所做的不仅仅是为了好玩并且您珍惜时间,我建议您考虑真正的 OCR 引擎,例如 ABBYY。您将获得开箱即用的更高准确性的分段和 OCR,当然还有专业的客户支持。

免责声明:我为 ABBYY 工作

As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.

Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.

I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.

There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.

Here's what you actually can do to address your problem:

  • If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
  • You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
  • Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.

Disclaimer: I work for ABBYY

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文