我从哪里开始进行文本模式识别 - 基于 Java

发布于 2024-11-14 16:36:42 字数 213 浏览 6 评论 0原文

我正在认真考虑做一个光学字符识别程序。我精通 Java,并且很想了解可用的库。基本上,我想将如下内容转换为文本。我需要手动中断来指定模式。例如,我需要要求用户在本文中标记 f,以便我知道 f 出现在哪里。

在此处输入图像描述

我完全是新手,所以我也不介意从头开始学习。需要指导。

I am seriously considering doing a Optical Character Recognition program. I am well versed with Java and would love to know about libraries available out there. Basically, I want to convert something like the following to text. I will need to give manual interruption to specify a pattern. For example, I would need to ask user to mark f in this text, so that I know where f occurs.

enter image description here

I am a newbie to this entirely, so I dont mind learning from scratch as well. Need guidance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

落花随流水 2024-11-21 16:36:42

如果您正在考虑从头开始编写 OCR 程序,那么阅读相关技术可能会有所帮助。我找到了 1996 年的 OCR 调查回顾了十五年前的一些流行技术。阅读可能会有所帮助;追踪它引用的论文或引用它的论文。

通常过程如下:

find text
find characters in the text
extract features from the characters found
do pattern matching
report suspected character

虽然让用户对文本进行注释既有趣又令人兴奋,但找到已注释的手写集合可能会节省您大量时间,这样您就可以专注于进行 OCR 的具体细节,而不是而不是建立自己的注释文本数据库。

要从稍微简单的任务开始,您可能需要考虑构建一个系统来检测手写数字。 USPS 制作了一个语料库,用于开发系统来进行邮政编码处理。该链接是我通过快速搜索找到的。

If you are thinking of coding an OCR program from scratch, reading up on techniques may be useful. I found an OCR Survey from 1996 which reviews some of the popular techniques from a decade and a half ago. Reading that might be helpful; track down papers it cites or papers which cite it.

Usually the process goes as follows:

find text
find characters in the text
extract features from the characters found
do pattern matching
report suspected character

While getting a user to annotate text is fun and exciting, finding a collection of handwriting which is already annotated might save you a lot of time, that way you can focus on the nuts and bolts of doing OCR rather than building your own database of annotated text.

To start with a slightly easier task you might want to consider building a system to detect handwritten digits. The USPS produced a corpus for developing systems to do this for zip code processing. The link was something I found with a quick search.

我要还你自由 2024-11-21 16:36:42

如果您想使用/查看某个库,可以尝试 Google 认可的 Tesseract

If you want to use/look at a library, you could try the Google-endorsed Tesseract.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文