为什么这种到 utf8 的转换不起作用?

发布于 2024-12-01 01:02:27 字数 408 浏览 0 评论 0原文

我有一个子进程命令,它输出一些字符,例如“\xf1”。我试图将其解码为 utf8 但出现错误。

s = '\xf1'
s.decode('utf-8')

上面的抛出:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data

当我使用“latin-1”时它可以工作,但 utf8 不应该也可以工作吗?我的理解是latin1是utf8的子集。

我在这里错过了什么吗?

编辑:

print s # ñ
repr(s) # returns "'\\xa9'"

I have a subprocess command that outputs some characters such as '\xf1'. I'm trying to decode it as utf8 but I get an error.

s = '\xf1'
s.decode('utf-8')

The above throws:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 0: unexpected end of data

It works when I use 'latin-1' but shouldn't utf8 work as well? My understanding is that latin1 is a subset of utf8.

Am I missing something here?

EDIT:

print s # ñ
repr(s) # returns "'\\xa9'"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

小矜持 2024-12-08 01:02:27

您混淆了 Unicode 和 UTF-8。 Latin-1 是 Unicode 的子集,但不是 UTF-8 的子集。 避免像瘟疫一样考虑单个代码单元。只需使用代码点即可。不要考虑UTF-8。考虑一下 Unicode。这就是你感到困惑的地方。

在 Python 中使用 Unicode 的演示程序源代码

非常简单。尤其是对于 Python 3 和宽构建,这是我使用 Python 的唯一方式,但如果您小心坚持使用 UTF-8,您仍然可以在窄构建下使用旧版 Python 2。

为此,请始终将源代码编码和输出编码正确地设置为 UTF-8。现在,停止考虑 UTF-anything,在整个 Python 程序中仅使用 UTF-8 文字、逻辑代码点数字或符号字符名称。

下面是带有行号的源代码:

% cat -n /tmp/py
     1  #!/usr/bin/env python3.2
     2  # -*- coding: UTF-8 -*-
     3  
     4  from __future__ import unicode_literals
     5  from __future__ import print_function
     6  
     7  import sys
     8  import os
     9  import re
    10  
    11  if not (("PYTHONIOENCODING" in os.environ)
    12              and
    13          re.search("^utf-?8$", os.environ["PYTHONIOENCODING"], re.I)):
    14      sys.stderr.write(sys.argv[0] + ": Please set your PYTHONIOENCODING envariable to utf8\n")
    15      sys.exit(1)
    16  
    17  print('1a: el ni\xF1o')
    18  print('2a: el nin\u0303o')
    19  
    20  print('1a: el niño')
    21  print('2b: el niño')
    22  
    23  print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
    24  print('2c: el nin\N{COMBINING TILDE}o')

下面是带有非 ASCII 字符的打印函数 uniquoted 使用\x{⋯} 表示法:

% grep -n ^print /tmp/py | uniquote -x
17:print('1a: el ni\xF1o')
18:print('2a: el nin\u0303o')
20:print('1b: el ni\x{F1}o')
21:print('2b: el nin\x{303}o')
23:print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
24:print('2c: el nin\N{COMBINING TILDE}o')

演示程序的示例运行

这是该程序的示例运行,显示了执行此操作的三种不同方式(a、b 和 c):第一个设置为文字在你的源代码中(这将受到 StackOverflow 的 NFC 转换的影响,因此不能被信任!!!),后两组分别带有数字 Unicode 代码点符号 Unicode 字符名称,再次 单引号 这样你就可以看到事情到底是什么:

% python /tmp/py
1a: el niño
2a: el niño
1b: el niño
2b: el niño
1c: el niño
2c: el niño

% python /tmp/py | uniquote -x
1a: el ni\x{F1}o
2a: el nin\x{303}o
1b: el ni\x{F1}o
2b: el nin\x{303}o
1c: el ni\x{F1}o
2c: el nin\x{303}o

% python /tmp/py | uniquote -v
1a: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2a: el nin\N{COMBINING TILDE}o
1b: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2b: el nin\N{COMBINING TILDE}o
1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2c: el nin\N{COMBINING TILDE}o

我真的不喜欢看二进制,但这里是二进制看起来像什么bytes:

% python /tmp/py | uniquote -b
1a: el ni\xC3\xB1o
2a: el nin\xCC\x83o
1b: el ni\xC3\xB1o
2b: el nin\xCC\x83o
1c: el ni\xC3\xB1o
2c: el nin\xCC\x83o

这个故事的寓意

即使当您使用 UTF-8 源代码时,您也应该只考虑和使用逻辑 Unicode 代码点数字(或符号命名字符),而不是作为 UTF-8 串行表示形式基础的各个 8 位代码单元。 8(或 UTF-16)。需要代码单元而不是代码点的情况极为罕见,这只会让您感到困惑。

如果您使用 Python3 的广泛构建,您还将获得比这些选择的替代方案更可靠的行为,但这是 UTF-32 问题,而不是 UTF-8 问题。如果您顺其自然,UTF-32 和 UTF-8 都很容易使用。

You have confused Unicode with UTF-8. Latin-1 is a subset of Unicode, but it is not a subset of UTF-8. Avoid like the plague ever thinking about individual code units. Just use code points. Do not think about UTF-8. Think about Unicode instead. This is where you are being confused.

Source Code for Demo Program

Using Unicode in Python is very easy. It’s especially with Python 3 and wide builds, the only way I use Python, but you can still use the legacy Python 2 under a narrow build if you are careful about sticking to UTF-8.

To do this, always your source code encoding and your output encoding correctly to UTF-8. Now stop thinking of UTF-anything and use only UTF-8 literals, logical code point numbers, or symbolic character names throughout your Python program.

Here’s the source code with line numbers:

% cat -n /tmp/py
     1  #!/usr/bin/env python3.2
     2  # -*- coding: UTF-8 -*-
     3  
     4  from __future__ import unicode_literals
     5  from __future__ import print_function
     6  
     7  import sys
     8  import os
     9  import re
    10  
    11  if not (("PYTHONIOENCODING" in os.environ)
    12              and
    13          re.search("^utf-?8$", os.environ["PYTHONIOENCODING"], re.I)):
    14      sys.stderr.write(sys.argv[0] + ": Please set your PYTHONIOENCODING envariable to utf8\n")
    15      sys.exit(1)
    16  
    17  print('1a: el ni\xF1o')
    18  print('2a: el nin\u0303o')
    19  
    20  print('1a: el niño')
    21  print('2b: el niño')
    22  
    23  print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
    24  print('2c: el nin\N{COMBINING TILDE}o')

And here are print functions with their non-ASCII characters uniquoted using the \x{⋯} notation:

% grep -n ^print /tmp/py | uniquote -x
17:print('1a: el ni\xF1o')
18:print('2a: el nin\u0303o')
20:print('1b: el ni\x{F1}o')
21:print('2b: el nin\x{303}o')
23:print('1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o')
24:print('2c: el nin\N{COMBINING TILDE}o')

Sample Runs of Demo Program

Here’s a sample run of that program that shows the three different ways (a, b, and c) of doing it: the first set as literals in your source code (which will be subject to StackOverflow’s NFC conversions and so cannot be trusted!!!) and the second two sets with numeric Unicode code points and with symbolic Unicode character names respectively, again uniquoted so you can see what things really are:

% python /tmp/py
1a: el niño
2a: el niño
1b: el niño
2b: el niño
1c: el niño
2c: el niño

% python /tmp/py | uniquote -x
1a: el ni\x{F1}o
2a: el nin\x{303}o
1b: el ni\x{F1}o
2b: el nin\x{303}o
1c: el ni\x{F1}o
2c: el nin\x{303}o

% python /tmp/py | uniquote -v
1a: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2a: el nin\N{COMBINING TILDE}o
1b: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2b: el nin\N{COMBINING TILDE}o
1c: el ni\N{LATIN SMALL LETTER N WITH TILDE}o
2c: el nin\N{COMBINING TILDE}o

I really dislike looking at binary, but here is what that looks like as binary bytes:

% python /tmp/py | uniquote -b
1a: el ni\xC3\xB1o
2a: el nin\xCC\x83o
1b: el ni\xC3\xB1o
2b: el nin\xCC\x83o
1c: el ni\xC3\xB1o
2c: el nin\xCC\x83o

The Moral of the Story

Even when you use UTF-8 source, you should think and use only logical Unicode code point numbers (or symbolic named characters), not the individual 8-bit code units that underlie the serial representation of UTF-8 (or for that matter of UTF-16). It is extremely rare to need code units instead of code points, and it just confuses you.

You will also get more reliably behavior if you use a wide build of Python3 than you will get with alternatives to those choices, but that is a UTF-32 matter, not a UTF-8 one. Both UTF-32 and UTF-8 are easy to work with, if you just go with the flow.

耶耶耶 2024-12-08 01:02:27

UTF-8 不是 Latin-1 的子集。 UTF-8 使用相同的单字节对 ASCII 进行编码。对于所有其他代码点,都是多个字节。

简而言之,正如 Python 告诉您的那样,\xf1 不是有效的 UTF-8。 “意外的输入结束”表示该字节标记未提供的多字节序列的开始。

我建议您阅读 UTF-8

UTF-8 is not a subset of Latin-1. UTF-8 encodes ASCII with the same single bytes. For all other code points, it's all multiple bytes.

Put simply, \xf1 is not valid UTF-8, as Python tells you. "Unexpected end of input" indicates that this byte marks the beginning of a multi-byte sequence which is not provided.

I recommend you read up on UTF-8.

一桥轻雨一伞开 2024-12-08 01:02:27

简单的方法(python 3)

s='\xf1'
bytes(s, 'utf-8').decode('utf-8')
#'ñ'

如果您尝试解码转义的unicode,您可以使用:

s='Autom\\u00e1tico'
bytes(s, "utf-8").decode('unicode-escape')
#'Automático'

the easy way (python 3)

s='\xf1'
bytes(s, 'utf-8').decode('utf-8')
#'ñ'

if you are trying decode escaped unicode you can use:

s='Autom\\u00e1tico'
bytes(s, "utf-8").decode('unicode-escape')
#'Automático'
↘紸啶 2024-12-08 01:02:27

它是 UTF-8 中多字节序列的第一个字节,因此它本身无效。

事实上,它是 4 字节序列的第一个字节。

Bits Last code point Byte 1   Byte 2   Byte 3   Byte 4   Byte 5   Byte 6
21   U+1FFFFF        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

请参阅此处了解更多信息。

It's the first byte of a multi-byte sequence in UTF-8, so it's not valid by itself.

In fact, it's the first byte of a 4 byte sequence.

Bits Last code point Byte 1   Byte 2   Byte 3   Byte 4   Byte 5   Byte 6
21   U+1FFFFF        11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

See here for more info.

江城子 2024-12-08 01:02:27

我的理解是latin1是utf8的子集。

错误的。 Latin-1,又名 ISO 8859-1 (有时错误地为 Windows-1252),不是 UTF-8 的子集。另一方面,ASCII UTF-8 的子集。 ASCII 字符串是有效的 UTF-8 字符串,但广义的 Windows-1252 或 ISO 8859-1 字符串不是有效的 UTF-8,这就是 s.decode('UTF-8') 抛出异常的原因UnicodeDecodeError

My understanding is that latin1 is a subset of utf8.

Wrong. Latin-1, aka ISO 8859-1 (and sometimes erroneously as Windows-1252), is not a subet of UTF-8. ASCII, on the other hand, is a subset of UTF-8. ASCII strings are valid UTF-8 strings, but generalized Windows-1252 or ISO 8859-1 strings are not valid UTF-8, which is why s.decode('UTF-8') is throwing a UnicodeDecodeError.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文