当前位置：文江博客话题详情

限制 tesseract 正在寻找的字符

发布于 2024-08-23 15:39:51 字数 127 浏览 9 评论 0原文

是否可以限制 tesseract 正在查找的字符集（例如仅搜索字母 az）？这将大大提高我的成绩。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

依靠 2024-08-30 15:39:51

在 tessdata/configs 目录中创建一个配置文件（例如“字母”） - 通常是 /usr/share/tesseract/tessdata/configs
或
/usr/share/tesseract-ocr/tessdata/configs

并将此行添加到配置文件中：

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...或者 [az] 可以工作。我不知道。然后像这样调用 tesseract：

tesseract input.tif output nobatch letters

这将限制 tesseract 只识别想要的字符。

Create a config file (e.g "letters") in tessdata/configs directory - usually /usr/share/tesseract/tessdata/configs
or
/usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...or maybe [a-z] works. I don't know. Then call tesseract similar to this:

tesseract input.tif output nobatch letters

That will limit tesseract to recognize only the wanted characters.

回复收藏 0 原文

少女的英雄梦 2024-08-30 15:39:51

要在配置文件中使用白名单或使用 -c tessedit_char_whitelist=... 命令行开关，在最新的 4.0 版本中，您必须将 OCR 引擎模式设置为“仅限原始 Tesseract”。这是因为新的“神经网络 LSTM”模式不尊重白名单设置。
4.0 版本的正确命令行示例：

tesseract 输入文件输出文件 --oem 0 -c
tessedit_char_whitelist=abc123

更新：在较新的版本 (4.0) 中，Windows 和某些 Linux 安装程序默认安装的 eng.traineddata 文件已损坏。临时解决方案是将 tessdata\eng.traineddata 文件替换为旧版本的文件。该文件应约为 30MB。否则你会得到错误：“Tesseract 无法加载任何语言！”或类似的。

从tesseract 4.1.1更新

但是，在tesseract 4.1.1中，上述错误已修复，也就是说，在tesseract 4.1.1中，以下工作就像一个魅力
tesseract my_image.jpg stdout -l mylang configfile myconfig

其中“myconfig”是位于 TESSDATA/configs 中的纯文本文件

load_system_dawg false
load_freq_dawg false
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

To use whitelist in a config file or using the -c tessedit_char_whitelist=... command-line switch, in the newest 4.0 version you will have to set OCR Engine mode to the "Original Tesseract only". This is because the new "Neural nets LSTM" mode doesn't respect the whitelist setting.
Example of proper command-line for 4.0 version:

tesseract input_file output_file --oem 0 -c
tessedit_char_whitelist=abc123

UPDATE: In newer versions (4.0) there's corrupted eng.traineddata file installed by default by Windows and some Linux installers. Temporary solution is to replace tessdata\eng.traineddata file with one from older version. This file should be about 30MB. Otherwise you'll get Error: "Tesseract couldn't load any languages!" or similar.

Update from tesseract 4.1.1

However, in tesseract 4.1.1 the above bug is fixed, that is, in tesseract 4.1.1 the following works like a charm
tesseract my_image.jpg stdout -l mylang configfile myconfig

Where "myconfig" is a plaintext file located in TESSDATA/configs

load_system_dawg false
load_freq_dawg false
tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789

回复收藏 0 原文

夏末染殇 2024-08-30 15:39:51

除了配置文件之外，还有 -c 标志：

tesseract stdin stdout -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz -psm 6

更新

确认在版本：

4.1.1上工作

In addition to the config file, is the -c flag:

tesseract stdin stdout -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyz -psm 6

update

confirmed working on versions:

4.1.1

回复收藏 0 原文

时光与爱终年不遇 2024-08-30 15:39:51

只需为在 Android 上使用 tesseract 的任何人添加此即可。在设置语言等的 readOCR 函数中添加以下行；

tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ");

您还可以对要排除的字符执行黑名单。

Just adding this for anyone using tesseract on Android. In your readOCR function where you set the language etc. add the following line;

tesseract.setVariable("tessedit_char_whitelist","ABCDEFGHIJKLMNOPQRSTUVWXYZ");

you can also do blackList for characters to exclude.

回复收藏 0 原文

喜你已久 2024-08-30 15:39:51

在 Tesseract 4.00 版本中，这是无法完成的。您只能微调模型或使用正则表达式从预测中删除额外的字符。

回复收藏 0 原文

静赏你的温柔 2024-08-30 15:39:51

我使用的是 Ubuntu 18.04.4 LTS。默认的 tesseract 是版本 4。我不能使用白名单。然后我将其升级到版本 5。然后我使用下面的命令，它起作用了。

tesseract sample.jpg stdout -l eng --oem 3 --psm 7
Warning: Invalid resolution 0 dpi. Using 70 instead.
LL £036 GL)

tesseract sample.jpg stdout -l eng --oem 3 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
Warning: Invalid resolution 0 dpi. Using 70 instead.
L4036GL

sample.jpg

I am using Ubuntu 18.04.4 LTS. The default tesseract is version 4. I can not use whitelist with it. Then I upgrade it to version 5. Then I use below command and it worked.

tesseract sample.jpg stdout -l eng --oem 3 --psm 7
Warning: Invalid resolution 0 dpi. Using 70 instead.
LL £036 GL)

tesseract sample.jpg stdout -l eng --oem 3 --psm 7 -c tessedit_char_whitelist="ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
Warning: Invalid resolution 0 dpi. Using 70 instead.
L4036GL

sample.jpg

回复收藏 0 原文

坠似风落 2024-08-30 15:39:51

我的答案完全源自已接受的答案，并添加到此处是为了使使用 Tesseract NuGet 包的任何 .NET Windows 开发人员受益 - 但是，请注意我的第 2 点，它适用于任何人 /em> 在 Windows 上使用任何类型的 Tesseract

在您的 tessdata 文件夹中创建一个 config 文件夹，其中包含其他训练数据位于。
在 config 文件夹中添加一个 letters 文件。

使用像 TextPad 这样的编辑器可以帮助您将其保存在 UNIX 中
格式，ANSI 编码（我最初尝试过 UTF-8 / IBM PC 和
tesseract 在我的测试输出中抛出错误）
就像您的训练文件一样，确保属性面板中的 letters 文件将构建操作设置为 Content 并进一步标记为复制到输出目录：
调用你的超立方体引擎类：

 var ocrEng = new TesseractEngine("./tessdata", "eng", EngineMode.Default, "letters");

My answer is derived wholly from the accepted answer, and is added here to benefit any .NET windows developers using the Tesseract NuGet package - however, take note of my bullet 2 which applies to anybody using any kind of Tesseract on Windows

Create a config folder inside your tessdata folder where the other training data is located.
Add a letters file inside the config folder.

Use an editor like TextPad that will help you save it in UNIX
format, ANSI encoding (I had initially tried UTF-8 / IBM PC and
tesseract was puking an error into my Tests output)
Just like your training files, ensure the letters file, in the Properties panel has a Build Action set to Content and further marked to copy to the output directory:
Invoke your tesseract engine class thusly:

 var ocrEng = new TesseractEngine("./tessdata", "eng", EngineMode.Default, "letters");

回复收藏 0 原文

~没有更多了~

关于作者

晨曦÷微暖

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

限制 tesseract 正在寻找的字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

从tesseract 4.1.1更新

Update from tesseract 4.1.1

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

限制 tesseract 正在寻找的字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（7）

从tesseract 4.1.1更新

Update from tesseract 4.1.1

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。