OCR:如何提高准确性 - 用于删除非文本“家具”、形状等的现有库以避免混淆 OCR?

发布于 2024-08-24 19:29:59 字数 1760 浏览 11 评论 0原文

我想删除屏幕截图图像中包含文本的矩形等,以便我可以执行光学字符识别以从屏幕截图中获取准确的文本。

背景:

我这样做是为了从旧应用程序中提取数据以供其他应用程序使用。这是获取此数据的唯一方法,因为关联文件采用封闭的、专有的二进制格式。

我将使用 AutoItScript 驱动应用程序在其 UI 中显示数据,然后我将对其进行屏幕截图并将其提供给 tesseract。

我已经在 UI 自动化方面取得了一些成功,并且能够使用 tesseract 从位图中获取纯 ascii 文本。

有几篇 AutoItScripr 论坛文章讨论了它与 tesseract/OCR 的使用,但不是专门针对我的问题。 http://www.autoitscript.com/forum/index.html php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2

我需要做什么

有一些薄的、1像素宽的矩形紧密包围着一些文本,当输入到超正方体时,它会像我一样看到它们例如矩形的垂直线。

关于如何删除矩形或最佳实践有什么想法吗?

我问是否有一个基于通用命令行的工具集来覆盖矩形,例如 .png 文件中的矩形。然后我可以通过它传递 .png,然后将其传递给超正方体。

有关我使用的 tesseract 版本/设置的详细信息如下:

转到此处:http://code.google.com/p/tesseract-ocr/downloads/list - 对于用于启动和运行 Tesseract 并将位图文本识别为 ascii 文本的基本英语通用字符集,请使用 tesseract-2.00.eng.tar.gz(撰写本文时的当前版本是:“Tesseract 的英语语言数据(2.00及以上) 2007 年 7 月 989 KB 84845")

我已经在 Stack Overflow 上查看过的相关问题

在这些中,我的问题没有得到完全解答,或者正在出售商业解决方案。我现阶段不想考虑商业解决方案。

I want to remove rectangles etc that enclose text in a screenshot image, so that I can perform optical character recognition to get accurate text from the screenshot.

Background:

I doing this to extract data from a legacy application for use with other applications. This is the only way to get at this data as associated files are in a closed, proprietary, binary format.

I will be using AutoItScript to drive the application to show data in its UI, then I will screenshot this and feed this to tesseract.

I've already had some success in automating the UI, and have been able to use tesseract to get plain ascii text out of the bitmap.

There are several AutoItScripr forum articles discussing its use with tesseract/OCR but not specifically for my question.
http://www.autoitscript.com/forum/index.php?s=6c32c3ece12756e635a619cdf175eff9&showforum=2

What I need to do

There are thin, 1-pixel wide rectangles that closely enclose some text, when fed to tesseract, it sees them as I for example for a verticle line of the rectangle.

Any thoughts on how to remove the rectangles, or best practices?

I'm asking if there is a generic command line based toolset to overwrite rectangles, for example, in .png files. I could then pass the .png through this, then pass it to tesseract.

Details on the tesseract release/setup I've used are as follows:

Go here: http://code.google.com/p/tesseract-ocr/downloads/list
- For the basic english generic character set to get Tesseract up and running and recognising your bitmapped text into ascii text, use tesseract-2.00.eng.tar.gz (current version at time of writing is: "English language data for Tesseract (2.00 and up) Jul 2007 989 KB 84845")

Related questions I have already looked at on Stack Overflow

In these, my question is not completely answered or a commercial solution is being sold. I do not want to consider a commercial solution at this stage.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

若水微香 2024-08-31 19:29:59

可能不会有一个免费的现成解决方案,但编写自己的代码应该不会太难,因为可以安全地假设矩形永远不会是字体字母表中的有效字符,因此可以安全地删除。所有矩形边框的宽度均为一像素也很有帮助。

因此,搜索一条连续的水平线,该水平线通过两条垂直线与另一条相同长度的平行线相连。重复搜索,直到找到图像中的所有矩形,然后使用 Graphics.DrawRectangle钢笔.透明。在完成搜索之前不要将矩形渲染为透明,否则您可能会在找到重叠矩形之前将其部分擦除。这只是一个入门建议,我还没有实现或调试这个算法。

There's probably not going to be a free off the shelf solution for this, but coding your own shouldn't be too hard since it's probably safe to assume that a rectangle will never be a valid character in your font's alphabet and can therefore be removed safely. It also helps that all your rectangle borders are exactly one pixel wide.

So search for a contiguous horizontal line that is joined to another, parallel line of the same length by exactly two vertical lines. Repeat the search until you find all the rectangles in the image then render them all transparent with Graphics.DrawRectangle and Pens.Transparent. Don't render a rectangle transparent until you've finished searching else you risk wiping out parts of overlapped rectangles before you've found them. This is just a starter suggestion, I haven't implemented or debugged this algorithm.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文