Python UTF-16 WAVY DASH 编码问题/问题

发布于 2024-08-21 20:05:09 字数 1324 浏览 8 评论 0原文

我今天在做一些工作,遇到了一个“看起来很有趣”的问题。我一直将一些字符串数据解释为 utf-8,并检查编码形式。数据通过 python-ldap 来自 ldap(特别是 Active Directory)。那里没有什么惊喜。

所以我几次遇到了字节序列 '\xe3\x80\xb0',当解码为 utf-8 时,它是 unicode 代码点 3030 (波浪破折号)。我需要utf-16的字符串数据,所以很自然地我通过.encode('utf-16')将其转换。不幸的是,似乎 python 不喜欢这个角色:

D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'

看来 IronPython 也不是它的粉丝:

D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'

如果有人能告诉我这里到底发生了什么,我将不胜感激。

I was doing some work today, and came across an issue where something "looked funny". I had been interpreting some string data as utf-8, and checking the encoded form. The data was coming from ldap (Specifically, Active Directory) via python-ldap. No surprises there.

So I came upon the byte sequence '\xe3\x80\xb0' a few times, which, when decoded as utf-8, is unicode codepoint 3030 (wavy dash). I need the string data in utf-16, so naturally I converted it via .encode('utf-16'). Unfortunately, it seems python doesn't like this character:

D:\> python
Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode("utf-8")
'\xe3\x80\xb0'
>>> u"\u3030".encode("utf-16-le")
'00'
>>> u"\u3030".encode("utf-16-be")
'00'
>>> '\xe3\x80\xb0'.decode('utf-8')
u'\u3030'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16')
'\xff\xfe00'
>>> '\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')
u'00'

It seems IronPython isn't a fan either:

D:\ipy
IronPython 2.6 Beta 2 (2.6.0.20) on .NET 2.0.50727.3053
Type "help", "copyright", "credits" or "license" for more information.
>>> u"\u3030"
u'\u3030'
>>> u"\u3030".encode('utf-8')
u'\xe3\x80\xb0'
>>> u"\u3030".encode('utf-16-le')
'00'

If somebody could tell me what, exactly, is going on here, it'd be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

自演自醉 2024-08-28 20:05:09

这似乎是正确的行为。字符 u'\u3030' 在 UTF-16 中编码时与 UTF-8 中的 '00' 编码相同。看起来很奇怪,但却是正确的。

您可以看到的“\xff\xfe”只是一个字节顺序标记

您确定想要波浪破折号,而不是其他字符吗?如果您希望使用不同的字符,那么可能是因为它在进入您的应用程序之前已经被错误编码。

This seems to be the correct behaviour. The character u'\u3030' when encoded in UTF-16 is the same as the encoding of '00' in UTF-8. It looks strange, but it's correct.

The '\xff\xfe' you can see is just a Byte Order Mark.

Are you sure you want a wavy dash, and not some other character? If you were hoping for a different character then it might be because it had already been misencoded before entering your application.

零崎曲识 2024-08-28 20:05:09

但它解码正常:

>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'

该字符的 UTF-16 编码恰好与“0”的 ASCII 代码一致。您也可以用 '\x30\x30' 表示它:

>>> '00' == '\x30\x30'
True

But it decodes okay:

>>> u"\u3030".encode("utf-16-le")
'00'
>>> '00'.decode("utf-16-le")
u'\u3030'

It's that the UTF-16 encoding of that character happens to coincide with the ASCII code for '0'. You could also represent it with '\x30\x30':

>>> '00' == '\x30\x30'
True
无言温柔 2024-08-28 20:05:09

您对这里的两件事感到困惑(也让我失望):

  1. utf-16 和 utf-32 编码使用 BOM,除非您通过 utf-16-be 等指定要使用的字节顺序。这是倒数第二行中的 \xff\xfe。
  2. “00”是两个字符数字零。它不是空字符。无论如何,打印结果会有所不同:

    <前><代码>>>> '\0\0'
    '\x00\x00'

You are being confused by two things here (threw me off too):

  1. utf-16 and utf-32 encodings use a BOM unless you specify which byte order to use, via utf-16-be and such. This is the \xff\xfe in the second last line.
  2. '00' is two of the characters digit zero. It is not a null character. That'd print differently anyway:

    >>> '\0\0'
    '\x00\x00'
    
安静被遗忘 2024-08-28 20:05:09

上面的示例代码中有一个基本错误。请记住,您将 Unicode编码编码字符串,然后将编码字符串解码为 Unicode。所以,你这样做:

'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

这转化为以下步骤:

'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!

u'\u3030'确实在UTF-16LE中被编码为'00'(ascii零两次),但你不知何故认为这是一个空字节('\0')或某物。

请记住,如果使用一种编码进行编码并使用另一种编码进行解码,则无法到达相同的字符:

>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'

在此代码中,我编码为 Windows-1252 并从 Windows-1253 解码。在您的代码中,您编码为 UTF-16LE 并从 UTF-8 解码。

There is a basic error in your sample code above. Remember, you encode Unicode to an encoded string, and you decode from an encoded string back to Unicode. So, you do:

'\xe3\x80\xb0'.decode('utf-8').encode('utf-16-le').decode('utf-8')

which translates to the following steps:

'\xe3\x80\xb0' # (some string)
.decode('utf-8') # decode above text as UTF-8 encoded text, giving u'\u3030'
.encode('utf-16-le') # encode u'\u3030' as UTF-16-LE, i.e. '00'
.decode('utf-8') # OOPS! decode using the wrong encoding here!

u'\u3030' is indeed encoded as '00' (ascii zero twice) in UTF-16LE but you somehow think that this is a null byte ('\0') or something.

Remember, you can't reach to the same character if you encode with one and decode with another encoding:

>>> import unicodedata as ud
>>> c= unichr(193)
>>> ud.name(c)
'LATIN CAPITAL LETTER A WITH ACUTE'
>>> ud.name(c.encode("cp1252").decode("cp1253"))
'GREEK CAPITAL LETTER ALPHA'

In this code, I encoded to Windows-1252 and decoded from Windows-1253. In your code, you encoded to UTF-16LE and decoded from UTF-8.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文