Python UTF-16 WAVY DASH 编码问题/问题
我今天在做一些工作,遇到了一个“看起来很有趣”的问题。我一直将一些字符串数据解释为 utf-8,并检查编码形式。数据通过 python-ldap 来自 ldap(特别是 Active Directory)。那里没有什么惊喜。
所以我几次遇到了字节序列 '\xe3\x80\xb0',当解码为 utf-8 时,它是 unicode 代码点 3030 (波浪破折号)。我需要utf-16的字符串数据,所以很自然地我通过.encode('utf-16')将其转换。不幸的是,似乎 python 不喜欢这个角色:
D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'
看来 IronPython 也不是它的粉丝:
D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'
如果有人能告诉我这里到底发生了什么,我将不胜感激。
I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.
So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:
D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'
It seems IronPython isn't a fan either:
D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'
If somebody could tell me what, exactly, is going on here, it'd be much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这似乎是正确的行为。字符 u'\u3030' 在 UTF-16 中编码时与 UTF-8 中的 '00' 编码相同。看起来很奇怪,但却是正确的。
您可以看到的“\xff\xfe”只是一个字节顺序标记。
您确定想要波浪破折号,而不是其他字符吗?如果您希望使用不同的字符,那么可能是因为它在进入您的应用程序之前已经被错误编码。
This seems to be the correct behaviour. The character u'\u3030' when encoded in UTF-16 is the same as the encoding of '00' in UTF-8. It looks strange, but it's correct.
The '\xff\xfe' you can see is just a Byte Order Mark.
Are you sure you want a wavy dash, and not some other character? If you were hoping for a different character then it might be because it had already been misencoded before entering your application.
但它解码正常:
该字符的 UTF-16 编码恰好与“0”的 ASCII 代码一致。您也可以用 '\x30\x30' 表示它:
But it decodes okay:
It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':
您对这里的两件事感到困惑(也让我失望):
“00”是两个字符数字零。它不是空字符。无论如何,打印结果会有所不同:
<前><代码>>>> '\0\0'
'\x00\x00'
You are being confused by two things here (threw me off too):
'00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:
上面的示例代码中有一个基本错误。请记住,您将 Unicode编码为编码字符串,然后将编码字符串解码为 Unicode。所以,你这样做:
这转化为以下步骤:
u'\u3030'确实在UTF-16LE中被编码为'00'(ascii零两次),但你不知何故认为这是一个空字节('\0')或某物。
请记住,如果使用一种编码进行编码并使用另一种编码进行解码,则无法到达相同的字符:
在此代码中,我编码为 Windows-1252 并从 Windows-1253 解码。在您的代码中,您编码为 UTF-16LE 并从 UTF-8 解码。
There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:
which translates to the following steps:
u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.
Remember, you can't reach to the same character if you encode with one and decode with another encoding:
In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.