替换Unicode字符 / Python / django

发布于 2025-02-05 18:21:16 字数 865 浏览 3 评论 0原文

由于我几乎被迫替换某些OCR技术返回的字符串中的某些Unicode字符,因此我发现这样做的唯一方法就是将它们替换为“一一”。这是使用以下代码完成的:

def recode(mystr):
    mystr = mystr.replace(r'\u0104', '\u0104')
    mystr = mystr.replace(r'\u017c', '\u017c')
    mystr = mystr.replace(r'\u0106' , '\u0106')
    ...
    ...
    mystr = mystr.replace(r'\u017a' , '\u017a')
    mystr = mystr.replace(r'\u017c' , '\u017c')
    return mystr

我知道这可能会令人困惑。上述OCR API返回的字符串正在返回一系列字符,例如“ \ u017a”不是Unicode中的映射字符,而是“ \”,“ U”,“ 0”,“ 0”,“ 1”,“ 7) “,”一个“。但这不能从我的尽头改变。

上述解决方案非常混乱且不专业。但是,如果我尝试循环浏览我要“替换”的所有字符,看来它没有任何作用:

def recode(mystr):
    for foo in ['\u0106','\u0118','\u0141', ...... , '\u017a','\u017c']:
        mystr = mystr.replace(r'%s' % foo, foo)
    return mystr

为什么在这种情况下,foo string string in in in In In In In in第一个方案它做得正确吗?有什么区别?

Since I'm pretty much forced to replace some unicode characters in my string returned by some OCR technology the only way I found to do it is replace them "one by one". This is done using following code:

def recode(mystr):
    mystr = mystr.replace(r'\u0104', '\u0104')
    mystr = mystr.replace(r'\u017c', '\u017c')
    mystr = mystr.replace(r'\u0106' , '\u0106')
    ...
    ...
    mystr = mystr.replace(r'\u017a' , '\u017a')
    mystr = mystr.replace(r'\u017c' , '\u017c')
    return mystr

I know this might be confusing. The string returned by mentioned OCR API is returning a sequence of characters, for example "\u017a" is not a mapped character in Unicode but rather "\" , "u", "0", "1", "7", "a". But this can't be changed from my end.

The above solution is very messy and unprofessional. However if I try to loop through all the characters that I want to "replace" it seems like it doesn't do anything:

def recode(mystr):
    for foo in ['\u0106','\u0118','\u0141', ...... , '\u017a','\u017c']:
        mystr = mystr.replace(r'%s' % foo, foo)
    return mystr

Why in this case the foo string is not read as a raw text if in first scenario it is done properly? What is the difference?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

雨轻弹 2025-02-12 18:21:16

因此,foo的原因不被读取,因为原始文本是字符串前的r仅在字符串是创建时扮演角色 - 之后,它将充当普通字符串 - 例如,应用 - 操作器时。

作为解决您想做的事情的解决方案,您可以尝试这样的事情:

bar = r"\u0104"
mystr = mystr.replace(bar, chr(int(bar[2:], 16)))

So the reason why foo is not read as raw text is that the r in front of a string only plays a role when the string is created - afterwards it will act as a normal string - for example when the %-operator is applied.

As a solution to what you want to do, you can try something like this:

bar = r"\u0104"
mystr = mystr.replace(bar, chr(int(bar[2:], 16)))
深海里的那抹蓝 2025-02-12 18:21:16

这是一个XY问题。 API正在返回字面的Unicode字符串。也许实际上是JSON,OP应该在返回的数据上执行json.loads(),但是如果没有,则可以使用unicode_escape cocdec来翻译逃生代码。该编解码器需要一个字节字符串,因此可能需要通过asciilatin1首先对其进行编码

def recode(mystr):
    mystr = mystr.replace(r'\u0104', '\u0104')
    mystr = mystr.replace(r'\u017c', '\u017c')
    mystr = mystr.replace(r'\u0106' , '\u0106')
    mystr = mystr.replace(r'\u017a' , '\u017a')
    mystr = mystr.replace(r'\u017c' , '\u017c')
    return mystr

def recode2(s):
    return s.encode('latin1').decode('unicode_escape')

s = r'\u0104\u017c\u0106\u017a\u017c'
print(s)
print(recode(s))
print(recode2(s))

\u0104\u017c\u0106\u017a\u017c
ĄżĆźż
ĄżĆźż

This is an X-Y problem. The API is returning literal Unicode strings. Maybe it is actually JSON and OP should be doing json.loads() on the returned data, but if not you can use the unicode_escape codec to translate the escape codes. That codec requires a byte string so it may need to be encoded via ascii or latin1 first:

def recode(mystr):
    mystr = mystr.replace(r'\u0104', '\u0104')
    mystr = mystr.replace(r'\u017c', '\u017c')
    mystr = mystr.replace(r'\u0106' , '\u0106')
    mystr = mystr.replace(r'\u017a' , '\u017a')
    mystr = mystr.replace(r'\u017c' , '\u017c')
    return mystr

def recode2(s):
    return s.encode('latin1').decode('unicode_escape')

s = r'\u0104\u017c\u0106\u017a\u017c'
print(s)
print(recode(s))
print(recode2(s))

Output:

\u0104\u017c\u0106\u017a\u017c
ĄżĆźż
ĄżĆźż
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文