多次编码（使用相同的编码格式）一个字符串有什么坏处吗？（Python）

发布于 2024-10-07 08:04:33 字数 280 浏览 4 评论 0原文

在Python中使用相同的编码格式多次编码一个字符串有什么坏处吗？（即UTF-8）？

我有一个函数，它使用另一个函数从文档中获取字符串，然后序列化该字符串。目前，第二个函数（从文档中获取字符串的函数）的唯一用户是第一个函数。

这将来可能会改变，有人可能决定在另一个序列化（或类似）函数中使用它，而不首先将其结果编码为 UTF-8。我想知道始终从中返回 UTF-8 编码的字符串是否安全（目前该字符串也将由序列化函数重新.encode()）。我的测试表明这不是问题，但是我想我应该问一下。

谢谢你！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

两个我 2024-10-14 08:04:34

你不能多次编码，这是行不通的。

>>> s = u"ä".encode('latin1')
>>> s = s.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

看，您得到“ascii 编解码器无法解码”。字符串上的编码方法的作用是首先将字符串解码为 Unicode，然后使用给定的编码再次对其进行编码。它将使用系统编码对其进行解码，默认情况下为 ascii。

顺便说一句，这种行为是出乎意料的，在 Python 3 中消失了，其中字节没有编码方法，字符串没有解码方法。

因此，您根本无法对其进行多次编码，当然这是因为对编码字符串进行编码根本没有任何意义。编码是将 Unicode 转换为二进制表示形式，并且您无法进一步对二进制表示形式进行编码。

You can't encode multiple times, it doesn't work.

>>> s = u"ä".encode('latin1')
>>> s = s.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

See, you get "ascii codec can't decode". What the encode method on a string does is that is first decodes the string to Unicode, and then encodes it again with the given encoding. It will decode it with the system encoding, which by default is ascii.

That behavior is unexpected and gone in Python 3, btw, where bytes doesn't have an encode method and strings doesn't have a decode method.

So you simply can't encode it multiple times, and of course that's because encoding an encoded string simply doesn't make any sense. Encoding is converting from Unicode to a binary representation, and you can't further encode a binary representation.

回复收藏 0 原文

瞎闹 2024-10-14 08:04:34

除非字符串是纯 ascii，否则它可能会造成伤害（如果是纯 ascii，则无需担心 utf-8）：

>>> a
u'a \xd7 b'
>>> a.encode("utf-8")
'a \xc3\x97 b'
>>> a.encode("utf-8").encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

将字节序列和文本视为两个不同的事物是一种很好的做法。在 Python 3 中，它们是不同的东西：字节对象具有 decode() 方法，而字符串 (unicode) 对象具有 encode() 方法。

Unless the string is pure ascii, then yes, it can cause harm (and if it's pure ascii, you don't need to worry about utf-8):

>>> a
u'a \xd7 b'
>>> a.encode("utf-8")
'a \xc3\x97 b'
>>> a.encode("utf-8").encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

It's good practice to treat byte sequences and text as two different things. In Python 3, they are different things: bytes objects have the decode() method, and string (unicode) objects have an encode() method.

回复收藏 0 原文