wsgi - 处理帖子中的 unicode 字符

发布于 2024-12-06 17:29:58 字数 777 浏览 1 评论 0原文

python 2.7

raw = '%C3%BE%C3%A6%C3%B0%C3%B6' #string from wsgi post_data
raw_uni = raw.replace('%', r'\x')
raw_uni # gives '\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6'
print raw uni #gives '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6+\\xC3\\xA9g'
print uni #gives \xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6+\xC3\xA9g

但是，如果我将 raw_uni 更改为：

raw_uni = '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'

现在执行：

uni = unicode(raw_uni, 'utf-8')
uni #gives u'\xfe\xe6\xf0\xf6'
print uni #gives þæðö

这就是我想要的。

如何摆脱 raw_uni 中的这个额外的 '\' 或利用它仅存在于字符串的 repr 版本中的事实？更重要的是，为什么 unicode(raw_uni, 'utf-8') 使用字符串的 repr 版本？？？

谢谢

原文

python 2.7

raw = '%C3%BE%C3%A6%C3%B0%C3%B6' #string from wsgi post_data
raw_uni = raw.replace('%', r'\x')
raw_uni # gives '\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6'
print raw uni #gives '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'
uni = unicode(raw_uni, 'utf-8')
uni #gives u'\\xC3\\xBE\\xC3\\xA6\\xC3\\xB0\\xC3\\xB6+\\xC3\\xA9g'
print uni #gives \xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6+\xC3\xA9g

However if I change raw_uni to be:

raw_uni = '\xC3\xBE\xC3\xA6\xC3\xB0\xC3\xB6'

and now do:

uni = unicode(raw_uni, 'utf-8')
uni #gives u'\xfe\xe6\xf0\xf6'
print uni #gives þæðö

which is what I want.

how do I get rid of this extra '\' in raw_uni or take advantage of the fact that it's only there in the repr version of the string? More to the point, why does unicode(raw_uni, 'utf-8') use the repr version of the string???

thanks

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月竹挽风 2024-12-13 17:29:58

您应该使用 urllib.unquote ，而不是手动替换：

>>> import urllib
>>> raw = '%C3%BE%C3%A6%C3%B0%C3%B6'
>>> urllib.unquote(raw)
'\xc3\xbe\xc3\xa6\xc3\xb0\xc3\xb6'
>>> unicode(urllib.unquote(raw), 'utf-8')
u'\xfe\xe6\xf0\xf6'

这里的根本问题是您对十六进制转义是什么有根本性的误解。不可打印字符的 repr 可以表示为十六进制转义符，它看起来像一个反斜杠，后跟一个“x”，后跟两个十六进制字符。这也是将这些字符键入字符串文字的方法，但它仍然只是单个字符。您的 replace 行不会将原始字符串转换为十六进制转义符，它只是将每个“%”替换为文字反斜杠字符，后跟“x”。

考虑以下示例：

>>> len('\xC3')         # this is a hex escape, only one character
1
>>> len(r'\xC3')        # this is four characters, '\', 'x', 'C', '3'
4
>>> r'\xC3' == '\\xC3'  # raw strings escape backslashes
True

如果由于某种原因您无法使用 urllib.unquote，则以下内容应该有效：

raw_uni = re.sub('%(\w{2})', lambda m: chr(int(m.group(1), 16)), raw)

You should be using urllib.unquote, not a manual replace:

>>> import urllib
>>> raw = '%C3%BE%C3%A6%C3%B0%C3%B6'
>>> urllib.unquote(raw)
'\xc3\xbe\xc3\xa6\xc3\xb0\xc3\xb6'
>>> unicode(urllib.unquote(raw), 'utf-8')
u'\xfe\xe6\xf0\xf6'

The underlying issue here is that you have a fundamental misunderstanding of what hex escapes are. The repr of a non-printable character can be expressed as a hex escape, which looks like a single backslash, followed by an 'x', followed by two hex characters. This is also how you would type these characters into a string literal, but it is still only a single character. Your replace line does not turn your original string into hex escapes, it just replaces each '%' with a literal backslash character followed by an 'x'.

Consider the following examples:

>>> len('\xC3')         # this is a hex escape, only one character
1
>>> len(r'\xC3')        # this is four characters, '\', 'x', 'C', '3'
4
>>> r'\xC3' == '\\xC3'  # raw strings escape backslashes
True

If for some reason you can't use urllib.unquote, the following should work:

raw_uni = re.sub('%(\w{2})', lambda m: chr(int(m.group(1), 16)), raw)

回复收藏 0 原文

~没有更多了~

关于作者

人间不值得

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

wsgi - 处理帖子中的 unicode 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

wsgi - 处理帖子中的 unicode 字符

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。