替换Unicode字符 / Python / django
由于我几乎被迫替换某些OCR技术返回的字符串中的某些Unicode字符,因此我发现这样做的唯一方法就是将它们替换为“一一”。这是使用以下代码完成的:
def recode(mystr):
mystr = mystr.replace(r'\u0104', '\u0104')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u0106' , '\u0106')
...
...
mystr = mystr.replace(r'\u017a' , '\u017a')
mystr = mystr.replace(r'\u017c' , '\u017c')
return mystr
我知道这可能会令人困惑。上述OCR API返回的字符串正在返回一系列字符,例如“ \ u017a”不是Unicode中的映射字符,而是“ \”,“ U”,“ 0”,“ 0”,“ 1”,“ 7) “,”一个“
。但这不能从我的尽头改变。
上述解决方案非常混乱且不专业。但是,如果我尝试循环浏览我要“替换”的所有字符,看来它没有任何作用:
def recode(mystr):
for foo in ['\u0106','\u0118','\u0141', ...... , '\u017a','\u017c']:
mystr = mystr.replace(r'%s' % foo, foo)
return mystr
为什么在这种情况下,foo
string string in in in In In In In in第一个方案它做得正确吗?有什么区别?
Since I'm pretty much forced to replace some unicode characters in my string returned by some OCR technology the only way I found to do it is replace them "one by one". This is done using following code:
def recode(mystr):
mystr = mystr.replace(r'\u0104', '\u0104')
mystr = mystr.replace(r'\u017c', '\u017c')
mystr = mystr.replace(r'\u0106' , '\u0106')
...
...
mystr = mystr.replace(r'\u017a' , '\u017a')
mystr = mystr.replace(r'\u017c' , '\u017c')
return mystr
I know this might be confusing. The string returned by mentioned OCR API is returning a sequence of characters, for example "\u017a" is not a mapped character in Unicode but rather "\" , "u", "0", "1", "7", "a"
. But this can't be changed from my end.
The above solution is very messy and unprofessional. However if I try to loop through all the characters that I want to "replace" it seems like it doesn't do anything:
def recode(mystr):
for foo in ['\u0106','\u0118','\u0141', ...... , '\u017a','\u017c']:
mystr = mystr.replace(r'%s' % foo, foo)
return mystr
Why in this case the foo
string is not read as a raw text if in first scenario it is done properly? What is the difference?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
因此,
foo
的原因不被读取,因为原始文本是字符串前的r
仅在字符串是创建时扮演角色 - 之后,它将充当普通字符串 - 例如,应用%
- 操作器时。作为解决您想做的事情的解决方案,您可以尝试这样的事情:
So the reason why
foo
is not read as raw text is that ther
in front of a string only plays a role when the string is created - afterwards it will act as a normal string - for example when the%
-operator is applied.As a solution to what you want to do, you can try something like this:
这是一个XY问题。 API正在返回字面的Unicode字符串。也许实际上是JSON,OP应该在返回的数据上执行
json.loads()
,但是如果没有,则可以使用unicode_escape
cocdec来翻译逃生代码。该编解码器需要一个字节字符串,因此可能需要通过ascii
或latin1
首先对其进行编码:
This is an X-Y problem. The API is returning literal Unicode strings. Maybe it is actually JSON and OP should be doing
json.loads()
on the returned data, but if not you can use theunicode_escape
codec to translate the escape codes. That codec requires a byte string so it may need to be encoded viaascii
orlatin1
first:Output: