Pytesseract 将一组字符串列入白名单
我正在尝试从报告中的报告中检测文本作为图像。该报告始终具有相同的结构。
例如 请参阅下面的图片
Google找到的示例。
实际报告主要包含来自航空业的内部缩写,而Pytesseract无法正确识别。
该程序必须仅识别CC,C1,...,参数名称(字符串列表) +数字。因此,基本上,我正在寻找一种将几个字符串和所有数字的白人主义者的方法。这是在Pytesseract中指定的可能性吗? 到目前为止,我只找到了Tessedit_char_whitelist,不幸的是,这对我没有帮助。
到目前为止,我所做的事情:
import cv2
import pytesseract as pt
import pandas as pd
import numpy as np
filename = 'Rep15_4.jpg'
img = cv2.imread(filename)
# best resutls!!!
#config = r'--oem 1 --psm 4'
# or
#config = r'--oem 1 --psm 6'
# or
#config = r'--oem 3 --psm 4'
config = r'--oem 3 --psm 4'
text = pt.image_to_string(img, config = config)
对于配置,我尝试了OEM和PSM的所有选项。最重要的是,我试图用CV2 - &GT来操纵图像。 cvtcolor,高斯布鲁尔,阈值。最有问题的是0个值。
感谢您的帮助。
I am trying to detect text from a report safed as image. The report always has the same structure.
e.g.
See picture below
Example found by google.
The actual report contains mostly internal abbreviations from the aviation industry which are not recognized correctly by Pytesseract.
The program must recognize only CC, C1, ..., Parameter Names (list of Strings) + numbers. So basicly im look for a way to whitelist a couple of strings and all numbers. Is this possibe to specify in Pytesseract?
So far I have only found tessedit_char_whitelist, which unfortunately does not help me.
What i have done so far:
import cv2
import pytesseract as pt
import pandas as pd
import numpy as np
filename = 'Rep15_4.jpg'
img = cv2.imread(filename)
# best resutls!!!
#config = r'--oem 1 --psm 4'
# or
#config = r'--oem 1 --psm 6'
# or
#config = r'--oem 3 --psm 4'
config = r'--oem 3 --psm 4'
text = pt.image_to_string(img, config = config)
For config i have tried all options for oem and psm. On top i tried to manipulate the image with cv2 -> cvtColor, GaussianBlur, threshold. Most problematic are the 0 values.
Thanks for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论