多次编码(使用相同的编码格式)一个字符串有什么坏处吗? (Python)

发布于 2024-10-07 08:04:33 字数 280 浏览 4 评论 0原文

在Python中使用相同的编码格式多次编码一个字符串有什么坏处吗? (即UTF-8)?

我有一个函数,它使用另一个函数从文档中获取字符串,然后序列化该字符串。目前,第二个函数(从文档中获取字符串的函数)的唯一用户是第一个函数。

这将来可能会改变,有人可能决定在另一个序列化(或类似)函数中使用它,而不首先将其结果编码为 UTF-8。我想知道始终从中返回 UTF-8 编码的字符串是否安全(目前该字符串也将由序列化函数重新.encode())。我的测试表明这不是问题,但是我想我应该问一下。

谢谢你!

Is there any harm to encoding a string multiple times in python, with the same encoding format? (i.e, UTF-8)?

I have a function that uses another function to get a string from a document, and then serialize the string. Currently, the only user of the second function(the one which gets the string from the document) is the first function.

This might change in the future, and someone might decide to use it in another serialization (or such) function, without encoding its result to UTF-8 first. I'm wondering if its safe to always return a UTF-8 encoded string from it (this string will also be re-.encode()'d by the serialization function, at the moment). My testing indicates this isn't a problem, but, I figured I'd ask.

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

两个我 2024-10-14 08:04:34

你不能多次编码,这是行不通的。

>>> s = u"ä".encode('latin1')
>>> s = s.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

看,您得到“ascii 编解码器无法解码”。字符串上的编码方法的作用是首先将字符串解码为 Unicode,然后使用给定的编码再次对其进行编码。它将使用系统编码对其进行解码,默认情况下为 ascii。

顺便说一句,这种行为是出乎意料的,在 Python 3 中消失了,其中字节没有编码方法,字符串没有解码方法。

因此,您根本无法对其进行多次编码,当然这是因为对编码字符串进行编码根本没有任何意义。编码是将 Unicode 转换为二进制表示形式,并且您无法进一步对二进制表示形式进行编码。

You can't encode multiple times, it doesn't work.

>>> s = u"ä".encode('latin1')
>>> s = s.encode('latin1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0: ordinal not in range(128)

See, you get "ascii codec can't decode". What the encode method on a string does is that is first decodes the string to Unicode, and then encodes it again with the given encoding. It will decode it with the system encoding, which by default is ascii.

That behavior is unexpected and gone in Python 3, btw, where bytes doesn't have an encode method and strings doesn't have a decode method.

So you simply can't encode it multiple times, and of course that's because encoding an encoded string simply doesn't make any sense. Encoding is converting from Unicode to a binary representation, and you can't further encode a binary representation.

瞎闹 2024-10-14 08:04:34

除非字符串是纯 ascii,否则它可能会造成伤害(如果是纯 ascii,则无需担心 utf-8):

>>> a
u'a \xd7 b'
>>> a.encode("utf-8")
'a \xc3\x97 b'
>>> a.encode("utf-8").encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

将字节序列和文本视为两个不同的事物是一种很好的做法。在 Python 3 中,它们是不同的东西:字节对象具有 decode() 方法,而字符串 (unicode) 对象具有 encode() 方法。

Unless the string is pure ascii, then yes, it can cause harm (and if it's pure ascii, you don't need to worry about utf-8):

>>> a
u'a \xd7 b'
>>> a.encode("utf-8")
'a \xc3\x97 b'
>>> a.encode("utf-8").encode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)

It's good practice to treat byte sequences and text as two different things. In Python 3, they are different things: bytes objects have the decode() method, and string (unicode) objects have an encode() method.

桃扇骨 2024-10-14 08:04:34

一般来说,您应该只对 unicode 对象调用 encode,并且只对 string 对象调用 decode

encode 将 Unicode 对象编码为给定的编码(存储为字符串)。 decode 将给定的编码解码回 Unicode 对象。

2.x 中 string.encodeunicode.decode 的存在应被视为历史产物。

In general, you should only call encode on unicode objects and only call decode on string objects.

encode encodes a Unicode object into a given encoding (stored as a string). decode decodes a given encoding back into a Unicode object.

The existance of string.encode and unicode.decode in 2.x should be treated as a historical artifact.

木格 2024-10-14 08:04:34

好吧,如果您有一个 UTF-8 编码文本的字节流,并且您将它们解释为用其他内容编码的字符串,然后将其重新编码为 UTF-8,那么您就会遇到问题。

如果您再次将其读取为 UTF-8(当然,因为您不能将字节视为没有编码的文本),那么您就拥有了 Unicode,当再次写入为 UTF-8 时,它看起来将与以前相同。

只是要小心,不要过多地弄乱编码。一个常见的错误是将 UTF-8 编码文本解释为 Latin 1,从而将 Fööbär 转换为 Fööbär,当然它不会再改变了。

请注意文本(您关心的实际内容)和编码文本之间的区别,后者只是一堆字节以及如何将它们再次转换为文本的知识。如果把后者当作前者,就会出现问题。如果您正确地从一种表示形式转换为另一种表示形式,那就没问题了。

Well, if you have a stream of bytes that are UTF-8-encoded text and you interpret them as a string encoded in something else and then re-encoding it as UTF-8, then you have a problem.

If you read it as UTF-8 again (since you cannot treat bytes as text without an encoding, certainly), then you have Unicode, which, when written as UTF-8 again will look the same as before.

Just be careful not to mess around with the encodings too much. A common error is to interpret UTF-8 encoded text as Latin 1, thereby turning Fööbär into Fööbär which then of course won't change anymore again.

Note the difference between text (the actual thing you care about) and the encoded text which is just a bunch of bytes and the knowledge how to turn them into text again. If you treat the latter as the former, problems arise. If you convert properly from one representation to the other, it's fine.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文