OCR纠错算法

发布于 2024-10-31 19:47:03 字数 423 浏览 7 评论 0原文

我正在对大量扫描文档进行数字化,使用 Tesseract 3 作为我的 OCR 引擎。其输出质量平庸,因为它经常在实际文本之前和之后产生垃圾字符,以及文本中的拼写错误。

对于前一个问题,似乎必须有一些策略来确定哪些文本实际上是文本,哪些文本不是(这些文本大部分是人名之类的东西,所以我正在寻找除在文本中查找单词之外的解决方案)字典)。

对于拼写错误问题,大多数错误源于一些字母的错误分类(将 l1I 相互替换,例如例如),似乎应该有一些方法来猜测哪些单词拼写错误(因为英语中没有太多单词中间有“1”),并猜测适当的更正是什么。

这个领域的最佳实践是什么?是否有免费/开源的算法实现可以完成此类事情?谷歌已经发表了大量论文,但具体的内容并不多。如果没有可用的实现,那么众多论文中哪一篇是一个好的起点?

I'm working on digitizing a large collection of scanned documents, working with Tesseract 3 as my OCR engine. The quality of its output is mediocre, as it often produces both garbage characters before and after the actual text, and misspellings within the text.

For the former problem, it seems like there must be strategies for determining which text is actually text and which text isn't (much of this text is things like people's names, so I'm looking for solutions other than looking up words in a dictionary).

For the typo problem, most of the errors stem from a few misclassifications of letters (substituting l, 1, and I for one another, for instance), and it seems like there should be methods for guessing which words are misspelled (since not too many words in English have a "1" in the middle of them), and guessing what the appropriate correction is.

What are the best practices in this space? Are there free/open-source implementations of algorithms that do this sort of thing? Google has yielded lots of papers, but not much concrete. If there aren't implementations available, which of the many papers would be a good starting place?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

夜雨飘雪 2024-11-07 19:47:03

对于“确定哪些文本实际上是文本,哪些文本不是”,您可能需要查看开发 Tesseract 的同一部门(ISRI)的 rmgarbage。我编写了一个 Perl 实现,还有一个 Ruby 实现。对于 1 vs. l 问题,我正在尝试使用 ocrspell(同样来自同一部门),其中 他们的原始来源可用。

我只能发布两个链接,因此缺少的链接是:

For "determining which text is actually text and which text isn't" you might want to look at rmgarbage from same department that developed Tesseract (the ISRI). I've written a Perl implementation and there's also a Ruby implementation. For the 1 vs. l problem I'm experimenting with ocrspell (again from the same department), for which their original source is available.

I can only post two links, so the missing ones are:

  • ocrspell: enter "10.1007/PL00013558" at dx.doi.org]
  • rmgarbage: search for "Automatic Removal of Garbage Strings in OCR Text: An Implementation"
  • ruby implementation: search for "docsplit textcleaner"
笑饮青盏花 2024-11-07 19:47:03

对您可能有用的方法是尝试这个免费的在线 OCR 并将其结果与您的结果进行比较以查看如果通过处理图像(例如放大/缩小),您可以改善结果。

我用它作为我自己使用 tesseract 时应该得到的结果的“上限”(在使用 OpenCV 修改图像之后)。

Something that could be useful for you is to try this free online OCR and compare its results with yours to see if by playing with the image (e.g. scaling up/down) you could improve the results.

I was using it as an "upper bound" of the results I should get when using tesseract myself (after using OpenCV to modify the images).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文