训练 Tesseract 3 识别燃气表真实图像中的数字

发布于 2024-11-24 11:56:16 字数 208 浏览 1 评论 0原文

我正在尝试训练超正方体来识别燃气表真实图像中的数字。

我用于训练的图像是用相机制作的,因此存在很多问题:图像分辨率差、图像模糊、光线差或由于过度曝光、反射、阴影等而导致对比度低......

用于训练,我创建了一个大图像,其中包含燃气表图像捕获的一系列数字,并手动编辑文件框以创建 .tr 文件。结果是,只有更清晰和锐利图像的数字被识别,而模糊图像的数字未被超立方体捕获。

I'm trying to train tesseract to recognize numbers from real images of gas meters.

The images that I use for training are made with a camera, for this reason there are many problems: poor images resolution, blurred images, poor lighting or low contrast as a result of the overexposure, reflections, shadows, etc...

For training, I have created a large image with a series of digits captured by the images of the gas meter and I manually edited the file box to create the .tr files. The result is that only the digits of the clearer and sharper images are recognized while the digits of blurred images are not captured by tesseract.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

倾其所爱 2024-12-01 11:56:16

据我所知,您需要 OpenCV 来识别数字所在的框,但 OpenCV 并不是 OCR 之神。找到盒子后,只需裁剪该部分,进行图像处理,然后将其交给 tesseract 进行 OCR。

我需要 OpenCV 方面的帮助,因为我不知道如何在 OpenCV 中编程。

以下是一些现实世界的例子。

  • 第一张图像是原始图像(裁剪后的功率计数字)
  • 第二张图像是 GIMP 中稍微清理过的图像,在 tesseract 中 OCR 准确度约为 50%
  • 第三张图像是完全清理过的图像 - 无需任何训练即可识别 100% OCR!

第一张图片
第二张图片
第三张图片

As far as I can tell you need to OpenCV to recognize box in which numbers are located, but OpenCV is not god for OCR. After you locate box, just crop that part, do image processing and then hand it over to tesseract for OCR.

I need help with OpenCV because I don't know how to program in OpenCV.

Here are few real world examples.

  • First image is original image (croped power meter numbers)
  • Second image is slightly cleaned up image in GIMP, around 50% OCR accuracy in tesseract
  • Third image is completely cleaned image - 100% OCR recognized without any training!

first image
second image
third image

你怎么这么可爱啊 2024-12-01 11:56:16

我会首先尝试这个简单的 ImageMagick 命令:(

 convert          \
    original.jpg  \
   -threshold 50% \
    result.jpg

稍微使用 50% 参数 - 尝试使用更小和更高的值...)

阈值基本上只留下 2 个值,零或最大值,用于每个颜色通道。低于阈值的值设置为 0,高于阈值的值设置为 255(如果工作在 16 位深度,则为 65535)。

根据您的原始 .jpg,您可能会得到一个支持 OCR 的、有效的、对比度非常高的图像。

I would try this simple ImageMagick command first:

 convert          \
    original.jpg  \
   -threshold 50% \
    result.jpg

(Play a bit with the 50% parameter -- try with smaller and higher values...)

Thresholding basically leaves over only 2 values, zero or maximum, for each color channel. Values below the threshold get set to 0, values above it get set to 255 (or 65535 if working at 16-bit depth).

Depending on your original.jpg, you may have a OCR-able, working, very high contrast image as a result.

只想待在家 2024-12-01 11:56:16

我建议你:

  • 使用一个工具来编辑框,比如 jTessBoxEditor,它非常有帮助,让你赢得了一次。您可以从此处轻松安装它,
  • 最好训练实际情况的字母(嘈杂、模糊) 。您的训练集仍然有限,您可以添加更多训练样本。
  • 我建议您使用 Tesseract 的 API 本身来增强图像(去噪、标准化、锐化...)
    例如: Boxa * tesseract::TessBaseAPI::GetConnectedComponents(Pixa** pixa) (它允许您到达每个字符的边界框)

    Pix* pimg = tess_api->GetThresholdedImage();

此处您可以找到一些示例

I suggest you to:

  • use a tool to edit the boxes, such jTessBoxEditor, it's so helpful and let you winning a time. You can install it easily from here
  • it's good idea to train the letters of actual situation (noisy, blurred). Your training set is still limited, you can add more training samples.
  • I recommend you to use Tesseract's API themselves to enhance the image (denoise, normalize, sharpen...)
    for example : Boxa * tesseract::TessBaseAPI::GetConnectedComponents(Pixa** pixa) (it allows you to get to the bounding boxes of each character)

    Pix* pimg = tess_api->GetThresholdedImage();

Here you find few examples

一瞬间的火花 2024-12-01 11:56:16

Tesseract 是一个相当不错的 OCR 软件包,但不能正确预处理图像。我的经验是,如果您在将其传递给 tesseract 之前进行一些预处理,则可以获得良好的 OCR 结果。

有几个关键点可以显着提高识别能力:

  1. 消除背景噪音。基本上这意味着使用平均自适应阈值。我还要确保角色是黑色的,背景是白色的。
  2. 使用正确的分辨率。如果得到不好的结果,请放大或缩小图像,直到得到好的结果。您想要瞄准大约。 300 dpi 时字体大小 14;在我的处理发票的软件中效果最好。
  3. 不要将图像存储为 JPEG;使用 BMP 或 PNG 或其他不会使图像产生噪音的东西。
  4. 如果您只使用一种或两种字体,请尝试在这些字体上训练超正方体。

至于第 4 点,如果您知道要使用的字体,那么有一些比使用 Tesseract 更好的解决方案,例如直接在图像上匹配这些字体......基本算法是找到数字并将它们与所有可能的字符匹配(只有 10 个)……但是,实施起来还是很棘手。

Tesseract is a pretty decent OCR package, but doesn't pre-process images properly. My experience is that you can get a good OCR result if you just do some pre-processing before passing it on to tesseract.

There are a couple of key pointers that improves recognition significantly:

  1. Remove background noise. Basically this means using mean adaptive thresholding. I'd also ensure that the characters are black and the background is white.
  2. Use the correct resolution. If you get bad results, scale the image up or down until you get good results. You want to aim at approx. font size 14 at 300 dpi; in my software that processes invoices that works best.
  3. Don't store images as JPEG; use BMP or PNG or something else that doesn't make the image noisy.
  4. If you're only using one or two fonts, try training tesseract on these fonts.

As for point 4, if you know the font that's going to be used, there are some better solutions than using Tesseract like matching these fonts directly on the images... The basic algoritm is to find the digits and match them to all possible characters (which are only 10)... still, the implementation is tricky.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文