如何正确设置 Tesseract OCR

发布于 2025-01-14 23:14:03 字数 2182 浏览 0 评论 0原文

我正在使用 Tesseract OCR 尝试将预处理后的车牌图像转换为文本，但我对一些看起来非常好的图像没有取得太大成功。超立方体的设置可以在函数定义中看到。我正在 Google Colab 上运行这个。输入图像是下面的ZG NIVEA 1。我不确定我是否使用了错误的东西，或者是否有更好的方法来做到这一点 - 我从这个特定图像中得到的结果是 A。

!sudo apt install -q tesseract-ocr
!pip install -q pytesseract
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
import cv2
import re

def pytesseract_image_to_string(img, oem=3, psm=7) -> str:
  '''
  oem - OCR Engine Mode
      0 = Original Tesseract only.
      1 = Neural nets LSTM only.
      2 = Tesseract + LSTM.
      3 = Default, based on what is available.
  psm - Page Segmentation Mode
      0 = Orientation and script detection (OSD) only.
      1 = Automatic page segmentation with OSD.
      2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
      3 = Fully automatic page segmentation, but no OSD. (Default)
      4 = Assume a single column of text of variable sizes.
      5 = Assume a single uniform block of vertically aligned text.
      6 = Assume a single uniform block of text.
      7 = Treat the image as a single text line.
      8 = Treat the image as a single word.
      9 = Treat the image as a single word in a circle.
      10 = Treat the image as a single character.
      11 = Sparse text. Find as much text as possible in no particular order.
      12 = Sparse text with OSD.
      13 = Raw line. Treat the image as a single text line,
          bypassing hacks that are Tesseract-specific.
  '''
  tess_string = pytesseract.image_to_string(img, config=f'--oem {oem} --psm {psm}')
  regex_result = re.findall(r'[A-Z0-9]', tess_string) # filter only uppercase alphanumeric symbols
  return ''.join(regex_result)

image = cv2.imread('nivea.png')
print(pytesseract_image_to_string(image))

编辑：已接受答案中的方法适用于 ZGNIVEA1 图像，但不适用于其他图像，例如，是否有 Tesseract OCR 最适合的通用“字体大小”，或者是否有经验法则？

原文

I am using Tesseract OCR trying to convert a preprocessed license plate image into text, but I have not had much success with some images which look very much OK. The tesseract setup can be seen in the function definition. I am running this on Google Colab. The input image is ZG NIVEA 1 below. I am not sure if I am using something wrong or if there is a better way to do this - the result I get form this particular image is A.

!sudo apt install -q tesseract-ocr
!pip install -q pytesseract
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/usr/bin/tesseract'
import cv2
import re

def pytesseract_image_to_string(img, oem=3, psm=7) -> str:
  '''
  oem - OCR Engine Mode
      0 = Original Tesseract only.
      1 = Neural nets LSTM only.
      2 = Tesseract + LSTM.
      3 = Default, based on what is available.
  psm - Page Segmentation Mode
      0 = Orientation and script detection (OSD) only.
      1 = Automatic page segmentation with OSD.
      2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
      3 = Fully automatic page segmentation, but no OSD. (Default)
      4 = Assume a single column of text of variable sizes.
      5 = Assume a single uniform block of vertically aligned text.
      6 = Assume a single uniform block of text.
      7 = Treat the image as a single text line.
      8 = Treat the image as a single word.
      9 = Treat the image as a single word in a circle.
      10 = Treat the image as a single character.
      11 = Sparse text. Find as much text as possible in no particular order.
      12 = Sparse text with OSD.
      13 = Raw line. Treat the image as a single text line,
          bypassing hacks that are Tesseract-specific.
  '''
  tess_string = pytesseract.image_to_string(img, config=f'--oem {oem} --psm {psm}')
  regex_result = re.findall(r'[A-Z0-9]', tess_string) # filter only uppercase alphanumeric symbols
  return ''.join(regex_result)

image = cv2.imread('nivea.png')
print(pytesseract_image_to_string(image))

Edit: The approach in the accepted answer works for the ZGNIVEA1 image, but not for others, e.g. , is there a general "font size" that Tesseract OCR works with best, or is there a rule of thumb?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

眼藏柔 2025-01-21 23:14:03

通过在 OCR 之前应用高斯模糊，我最终得到了正确的输出。此外，您可能不需要通过将 -c tessedit_char_whitelist=ABC.. 添加到配置字符串来使用正则表达式。

为我生成正确输出的代码：

import cv2
import pytesseract

image = cv2.imread("images/tesseract.png")

config = '--oem 3  --psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ'

image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
image = cv2.GaussianBlur(image, (5, 5), 0)

string = pytesseract.image_to_string(image, config=config)

print(string)

输出：

答案 2：

抱歉回复晚了。我在你的第二张图片上测试了相同的代码，它给了我正确的输出，你确定你删除了配置部分，因为它不允许我的白名单中的数字。

这里最准确的解决方案是在车牌字体 (FE-Schrift) 上训练您自己的 tesseract 模型，而不是 tesseract 的默认 eng.traineddata 模型。它肯定会提高准确性，因为它只包含您案例的字符作为输出类。作为对后一个问题的回答，超立方体在识别过程之前会进行一些预处理（阈值、形态闭合等），这就是图像对字母大小如此敏感的原因。（较小的图像：轮廓彼此更接近，因此闭合不会将它们分开）。

要使用自定义字体训练 tesseract，您可以按照官方文档

要了解有关 Tesseract 理论部分的更多信息，您可以查看这些论文：
1（相对较旧）
2（较新）

by applying gaussian blur before OCR, I ended up with the correct output. Also, you may not need to use regex by adding -c tessedit_char_whitelist=ABC.. to your config string.

The code that produces correct output for me:

import cv2
import pytesseract

image = cv2.imread("images/tesseract.png")

config = '--oem 3  --psm 7 -c tessedit_char_whitelist=ABCDEFGHIJKLMNOPQRSTUVWXYZ'

image = cv2.resize(image, None, fx=2, fy=2, interpolation=cv2.INTER_CUBIC)
image = cv2.GaussianBlur(image, (5, 5), 0)

string = pytesseract.image_to_string(image, config=config)

print(string)

Output:

Answer 2:

Sorry for the late reply. I tested the same code on your second image, and it gave me correct output, are you sure you removed the config part since it doesnt allow numbers in my whitelist.

Most accurate solution here is training your own tesseract model on license plates' fonts (FE-Schrift) instead of tesseract's default eng.traineddata model. It will definetly increase the accuracy since it only contains your case's characters as output classes. As answer to your latter question, tesseract does some preprocessing before the recognition process (thresholding, morphological closing etc.) that is why image it is so sensitive to letter size. (smaller image: contours are closer to eachother so closing will not seperate them).

To train tesseract on custom font you can follow the official docs

To read more about Tesseract's theoritical part you can check these papers:
1 (relatively old)
2 (newer)

回复收藏 0 原文

~没有更多了~