为什么这种到 utf8 的转换不起作用?
我有一个子进程命令,它输出一些字符,例如“\xf1”。我试图将其解码为 utf8 但出现错误。
s = '\xf1'
s.decode('utf-8')
上面的抛出:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data
当我使用“latin-1”时它可以工作,但 utf8 不应该也可以工作吗?我的理解是latin1是utf8的子集。
我在这里错过了什么吗?
编辑:
print s # ñ
repr(s) # returns "'\\xa9'"
I have a subprocess command that outputs some characters such as '\xf1'. I'm trying to decode it as utf8 but I get an error.
s = '\xf1'
s.decode('utf-8')
The above throws:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data
It works when I use 'latin-1' but shouldn't utf8 work as well? My understanding is that latin1 is a subset of utf8.
Am I missing something here?
EDIT:
print s # ñ
repr(s) # returns "'\\xa9'"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
您混淆了 Unicode 和 UTF-8。 Latin-1 是 Unicode 的子集,但不是 UTF-8 的子集。 避免像瘟疫一样考虑单个代码单元。只需使用代码点即可。不要考虑UTF-8。考虑一下 Unicode。这就是你感到困惑的地方。
在 Python 中使用 Unicode 的演示程序源代码
非常简单。尤其是对于 Python 3 和宽构建,这是我使用 Python 的唯一方式,但如果您小心坚持使用 UTF-8,您仍然可以在窄构建下使用旧版 Python 2。
为此,请始终将源代码编码和输出编码正确地设置为 UTF-8。现在,停止考虑 UTF-anything,在整个 Python 程序中仅使用 UTF-8 文字、逻辑代码点数字或符号字符名称。
下面是带有行号的源代码:
下面是带有非 ASCII 字符的打印函数 uniquoted 使用
\x{⋯}
表示法:演示程序的示例运行
这是该程序的示例运行,显示了执行此操作的三种不同方式(a、b 和 c):第一个设置为文字在你的源代码中(这将受到 StackOverflow 的 NFC 转换的影响,因此不能被信任!!!),后两组分别带有数字 Unicode 代码点和符号 Unicode 字符名称,再次 单引号 这样你就可以看到事情到底是什么:
我真的不喜欢看二进制,但这里是二进制看起来像什么bytes:
这个故事的寓意
即使当您使用 UTF-8 源代码时,您也应该只考虑和使用逻辑 Unicode 代码点数字(或符号命名字符),而不是作为 UTF-8 串行表示形式基础的各个 8 位代码单元。 8(或 UTF-16)。需要代码单元而不是代码点的情况极为罕见,这只会让您感到困惑。
如果您使用 Python3 的广泛构建,您还将获得比这些选择的替代方案更可靠的行为,但这是 UTF-32 问题,而不是 UTF-8 问题。如果您顺其自然,UTF-32 和 UTF-8 都很容易使用。
You have confused Unicode with UTF-8. Latin-1 is a subset of Unicode, but it is not a subset of UTF-8. Avoid like the plague ever thinking about individual code units. Just use code points. Do not think about UTF-8. Think about Unicode instead. This is where you are being confused.
Source Code for Demo Program
Using Unicode in Python is very easy. It’s especially with Python 3 and wide builds, the only way I use Python, but you can still use the legacy Python 2 under a narrow build if you are careful about sticking to UTF-8.
To do this, always your source code encoding and your output encoding correctly to UTF-8. Now stop thinking of UTF-anything and use only UTF-8 literals, logical code point numbers, or symbolic character names throughout your Python program.
Here’s the source code with line numbers:
And here are print functions with their non-ASCII characters uniquoted using the
\x{⋯}
notation:Sample Runs of Demo Program
Here’s a sample run of that program that shows the three different ways (a, b, and c) of doing it: the first set as literals in your source code (which will be subject to StackOverflow’s NFC conversions and so cannot be trusted!!!) and the second two sets with numeric Unicode code points and with symbolic Unicode character names respectively, again uniquoted so you can see what things really are:
I really dislike looking at binary, but here is what that looks like as binary bytes:
The Moral of the Story
Even when you use UTF-8 source, you should think and use only logical Unicode code point numbers (or symbolic named characters), not the individual 8-bit code units that underlie the serial representation of UTF-8 (or for that matter of UTF-16). It is extremely rare to need code units instead of code points, and it just confuses you.
You will also get more reliably behavior if you use a wide build of Python3 than you will get with alternatives to those choices, but that is a UTF-32 matter, not a UTF-8 one. Both UTF-32 and UTF-8 are easy to work with, if you just go with the flow.
UTF-8 不是 Latin-1 的子集。 UTF-8 使用相同的单字节对 ASCII 进行编码。对于所有其他代码点,都是多个字节。
简而言之,正如 Python 告诉您的那样,\xf1 不是有效的 UTF-8。 “意外的输入结束”表示该字节标记未提供的多字节序列的开始。
我建议您阅读 UTF-8。
UTF-8 is not a subset of Latin-1. UTF-8 encodes ASCII with the same single bytes. For all other code points, it's all multiple bytes.
Put simply, \xf1 is not valid UTF-8, as Python tells you. "Unexpected end of input" indicates that this byte marks the beginning of a multi-byte sequence which is not provided.
I recommend you read up on UTF-8.
简单的方法(python 3)
如果您尝试解码转义的unicode,您可以使用:
the easy way (python 3)
if you are trying decode escaped unicode you can use:
它是 UTF-8 中多字节序列的第一个字节,因此它本身无效。
事实上,它是 4 字节序列的第一个字节。
请参阅此处了解更多信息。
It's the first byte of a multi-byte sequence in UTF-8, so it's not valid by itself.
In fact, it's the first byte of a 4 byte sequence.
See here for more info.
错误的。 Latin-1,又名 ISO 8859-1 (有时错误地为 Windows-1252),不是 UTF-8 的子集。另一方面,ASCII是 UTF-8 的子集。 ASCII 字符串是有效的 UTF-8 字符串,但广义的 Windows-1252 或 ISO 8859-1 字符串不是有效的 UTF-8,这就是
s.decode('UTF-8')
抛出异常的原因UnicodeDecodeError
。Wrong. Latin-1, aka ISO 8859-1 (and sometimes erroneously as Windows-1252), is not a subet of UTF-8. ASCII, on the other hand, is a subset of UTF-8. ASCII strings are valid UTF-8 strings, but generalized Windows-1252 or ISO 8859-1 strings are not valid UTF-8, which is why
s.decode('UTF-8')
is throwing aUnicodeDecodeError
.