如何防止Google Vision提取无关紧要的联赛

发布于 2025-02-12 15:45:59 字数 1222 浏览 5 评论 0原文

我正在使用Python上的Google Vision从身份证中提取文本。

我将在下面剪切图片，然后将其发送到Google API

由于ID卡仅包含英语，中文，数字和一些常见符号（例如 - *（）），我将Lagansaghints设置为[“ en”，“ zh-hant-hk”]。

data = {
    'features':{
        'type': 'TEXT_DETECTION'
    },
    'image':{
        'content': encoded_string.decode("utf-8")
    },
    'imageContext':{
        'languageHints': ["en","zh-Hant-HK"]
    }
}

但是，Google API返回了一些无关紧要的联赛：

"CHAN, Mang H\u1ed3\nHo"

有时它会返回不是英语或中文的Unicode。

在这种情况下，\ u1ed3是ồ（拉丁文和坟墓的拉丁小字母o）

我该如何防止这样的案件？

//更新

我尝试进行一些图像处理（例如阈值和对比度增加）以删除背景并在发送给Google API之前使文本更加清晰。

用灰色和阈值处理

用灰色和增加对比度

它仍然认为o是拉丁字母。

原文

I am using Google vision on python to extract text from id card.

I will cut the the picture as below and send it to google API

Since the id card contains only English, chinese, numbers and some common symbol (e.g. - * ()), I set the languageHints as ["en","zh-Hant-HK"].

data = {
    'features':{
        'type': 'TEXT_DETECTION'
    },
    'image':{
        'content': encoded_string.decode("utf-8")
    },
    'imageContext':{
        'languageHints': ["en","zh-Hant-HK"]
    }
}

However, the google API returns some irrelevant leaguages like this:

"CHAN, Mang H\u1ed3\nHo"

Sometimes it returns unicode which is not English or chinese.

In this case, \u1ed3 is ồ (LATIN SMALL LETTER O WITH CIRCUMFLEX AND GRAVE)

How can I prevent the case like this?

//update

I have tried to do some image processing (like threshold and increasing contrast) to remove the background and make the text sharper before I send it to google API.