Python UTF-16 WAVY DASH 编码问题/问题

发布于 2024-08-21 20:05:09 字数 1324 浏览 8 评论 0原文

我今天在做一些工作，遇到了一个“看起来很有趣”的问题。我一直将一些字符串数据解释为 utf-8，并检查编码形式。数据通过 python-ldap 来自 ldap（特别是 Active Directory）。那里没有什么惊喜。

所以我几次遇到了字节序列 '\xe3\x80\xb0'，当解码为 utf-8 时，它是 unicode 代码点 3030 (波浪破折号）。我需要utf-16的字符串数据，所以很自然地我通过.encode('utf-16')将其转换。不幸的是，似乎 python 不喜欢这个角色：

D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'

看来 IronPython 也不是它的粉丝：

D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'

如果有人能告诉我这里到底发生了什么，我将不胜感激。

原文

I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.

So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:

D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'

It seems IronPython isn't a fan either:

D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'

If somebody could tell me what, exactly, is going on here, it'd be much appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

自演自醉 2024-08-28 20:05:09

这似乎是正确的行为。字符 u'\u3030' 在 UTF-16 中编码时与 UTF-8 中的 '00' 编码相同。看起来很奇怪，但却是正确的。

您可以看到的“\xff\xfe”只是一个字节顺序标记。

您确定想要波浪破折号，而不是其他字符吗？如果您希望使用不同的字符，那么可能是因为它在进入您的应用程序之前已经被错误编码。

回复收藏 0 原文

零崎曲识 2024-08-28 20:05:09

但它解码正常：

>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'

该字符的 UTF-16 编码恰好与“0”的 ASCII 代码一致。您也可以用 '\x30\x30' 表示它：

>>> '00' == '\x30\x30'
True

But it decodes okay:

>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'

It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':

>>> '00' == '\x30\x30'
True

回复收藏 0 原文

无言温柔 2024-08-28 20:05:09

您对这里的两件事感到困惑（也让我失望）：

utf-16 和 utf-32 编码使用 BOM，除非您通过 utf-16-be 等指定要使用的字节顺序。这是倒数第二行中的 \xff\xfe。
“00”是两个字符数字零。它不是空字符。无论如何，打印结果会有所不同：
<前><代码>>>> '\0\0'
'\x00\x00'

You are being confused by two things here (threw me off too):

utf-16 and utf-32 encodings use a BOM unless you specify which byte order to use, via utf-16-be and such. This is the \xff\xfe in the second last line.
'00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:
```
>>> '\0\0'
'\x00\x00'
```

回复收藏 0 原文

安静被遗忘 2024-08-28 20:05:09

上面的示例代码中有一个基本错误。请记住，您将 Unicode编码为编码字符串，然后将编码字符串解码为 Unicode。所以，你这样做：

'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

这转化为以下步骤：

'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!

u'\u3030'确实在UTF-16LE中被编码为'00'（ascii零两次），但你不知何故认为这是一个空字节（'\0'）或某物。

请记住，如果使用一种编码进行编码并使用另一种编码进行解码，则无法到达相同的字符：

>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'

在此代码中，我编码为 Windows-1252 并从 Windows-1253 解码。在您的代码中，您编码为 UTF-16LE 并从 UTF-8 解码。

There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:

'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

which translates to the following steps:

'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!

u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.

Remember, you can't reach to the same character if you encode with one and decode with another encoding:

>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'

In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.

回复收藏 0 原文

~没有更多了~