wsgi - 处理帖子中的 unicode 字符
python 2.7
raw = '%C3%BE%C3%A6%C3%B0%C3%B6' #string from wsgi post_data
raw_uni = raw.replace('%', r'\x')
raw_uni # gives '\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6'
print raw uni #gives '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6+\\xC3\\xA9g'
print uni #gives \xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6+\xC3\xA9g
但是,如果我将 raw_uni 更改为:
raw_uni = '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'
现在执行:
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\xfe\xe6\xf0\xf6'
print uni #gives þæðö
这就是我想要的。
如何摆脱 raw_uni 中的这个额外的 '\' 或利用它仅存在于字符串的 repr 版本中的事实?更重要的是,为什么 unicode(raw_uni, 'utf-8') 使用字符串的 repr 版本???
谢谢
python 2.7
raw = '%C3%BE%C3%A6%C3%B0%C3%B6' #string from wsgi post_data
raw_uni = raw.replace('%', r'\x')
raw_uni # gives '\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6'
print raw uni #gives '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6+\\xC3\\xA9g'
print uni #gives \xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6+\xC3\xA9g
However if I change raw_uni to be:
raw_uni = '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'
and now do:
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\xfe\xe6\xf0\xf6'
print uni #gives þæðö
which is what I want.
how do I get rid of this extra '\' in raw_uni or take advantage of the fact that it's only there in the repr version of the string? More to the point, why does unicode(raw_uni, 'utf-8') use the repr version of the string???
thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该使用 urllib.unquote ,而不是手动替换:
这里的根本问题是您对十六进制转义是什么有根本性的误解。不可打印字符的
repr
可以表示为十六进制转义符,它看起来像一个反斜杠,后跟一个“x”,后跟两个十六进制字符。这也是将这些字符键入字符串文字的方法,但它仍然只是单个字符。您的replace
行不会将原始字符串转换为十六进制转义符,它只是将每个“%”替换为文字反斜杠字符,后跟“x”。考虑以下示例:
如果由于某种原因您无法使用 urllib.unquote,则以下内容应该有效:
You should be using
urllib.unquote
, not a manual replace:The underlying issue here is that you have a fundamental misunderstanding of what hex escapes are. The
repr
of a non-printable character can be expressed as a hex escape, which looks like a single backslash, followed by an 'x', followed by two hex characters. This is also how you would type these characters into a string literal, but it is still only a single character. Yourreplace
line does not turn your original string into hex escapes, it just replaces each '%' with a literal backslash character followed by an 'x'.Consider the following examples:
If for some reason you can't use
urllib.unquote
, the following should work: