多次编码(使用相同的编码格式)一个字符串有什么坏处吗? (Python)
在Python中使用相同的编码格式多次编码一个字符串有什么坏处吗? (即UTF-8)?
我有一个函数,它使用另一个函数从文档中获取字符串,然后序列化该字符串。目前,第二个函数(从文档中获取字符串的函数)的唯一用户是第一个函数。
这将来可能会改变,有人可能决定在另一个序列化(或类似)函数中使用它,而不首先将其结果编码为 UTF-8。我想知道始终从中返回 UTF-8 编码的字符串是否安全(目前该字符串也将由序列化函数重新.encode())。我的测试表明这不是问题,但是我想我应该问一下。
谢谢你!
Is there any harm to encoding a string multiple times in python, with the same encoding format? (i.e, UTF-8)?
I have a function that uses another function to get a string from a document, and then serialize the string. Currently, the only user of the second function(the one which gets the string from the document) is the first function.
This might change in the future, and someone might decide to use it in another serialization (or such) function, without encoding its result to UTF-8 first. I'm wondering if its safe to always return a UTF-8 encoded string from it (this string will also be re-.encode()'d by the serialization function, at the moment). My testing indicates this isn't a problem, but, I figured I'd ask.
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
你不能多次编码,这是行不通的。
看,您得到“ascii 编解码器无法解码”。字符串上的编码方法的作用是首先将字符串解码为 Unicode,然后使用给定的编码再次对其进行编码。它将使用系统编码对其进行解码,默认情况下为 ascii。
顺便说一句,这种行为是出乎意料的,在 Python 3 中消失了,其中字节没有编码方法,字符串没有解码方法。
因此,您根本无法对其进行多次编码,当然这是因为对编码字符串进行编码根本没有任何意义。编码是将 Unicode 转换为二进制表示形式,并且您无法进一步对二进制表示形式进行编码。
You can't encode multiple times, it doesn't work.
See, you get "ascii codec can't decode". What the encode method on a string does is that is first decodes the string to Unicode, and then encodes it again with the given encoding. It will decode it with the system encoding, which by default is ascii.
That behavior is unexpected and gone in Python 3, btw, where bytes doesn't have an encode method and strings doesn't have a decode method.
So you simply can't encode it multiple times, and of course that's because encoding an encoded string simply doesn't make any sense. Encoding is converting from Unicode to a binary representation, and you can't further encode a binary representation.
除非字符串是纯 ascii,否则它可能会造成伤害(如果是纯 ascii,则无需担心 utf-8):
将字节序列和文本视为两个不同的事物是一种很好的做法。在 Python 3 中,它们是不同的东西:字节对象具有
decode()
方法,而字符串 (unicode) 对象具有encode()
方法。Unless the string is pure ascii, then yes, it can cause harm (and if it's pure ascii, you don't need to worry about utf-8):
It's good practice to treat byte sequences and text as two different things. In Python 3, they are different things: bytes objects have the
decode()
method, and string (unicode) objects have anencode()
method.一般来说,您应该只对
unicode
对象调用encode
,并且只对string
对象调用decode
。encode
将 Unicode 对象编码为给定的编码(存储为字符串)。decode
将给定的编码解码回 Unicode 对象。2.x 中
string.encode
和unicode.decode
的存在应被视为历史产物。In general, you should only call
encode
onunicode
objects and only calldecode
onstring
objects.encode
encodes a Unicode object into a given encoding (stored as a string).decode
decodes a given encoding back into a Unicode object.The existance of
string.encode
andunicode.decode
in 2.x should be treated as a historical artifact.好吧,如果您有一个 UTF-8 编码文本的字节流,并且您将它们解释为用其他内容编码的字符串,然后将其重新编码为 UTF-8,那么您就会遇到问题。
如果您再次将其读取为 UTF-8(当然,因为您不能将字节视为没有编码的文本),那么您就拥有了 Unicode,当再次写入为 UTF-8 时,它看起来将与以前相同。
只是要小心,不要过多地弄乱编码。一个常见的错误是将 UTF-8 编码文本解释为 Latin 1,从而将
Fööbär
转换为Fööbär
,当然它不会再改变了。请注意文本(您关心的实际内容)和编码文本之间的区别,后者只是一堆字节以及如何将它们再次转换为文本的知识。如果把后者当作前者,就会出现问题。如果您正确地从一种表示形式转换为另一种表示形式,那就没问题了。
Well, if you have a stream of bytes that are UTF-8-encoded text and you interpret them as a string encoded in something else and then re-encoding it as UTF-8, then you have a problem.
If you read it as UTF-8 again (since you cannot treat bytes as text without an encoding, certainly), then you have Unicode, which, when written as UTF-8 again will look the same as before.
Just be careful not to mess around with the encodings too much. A common error is to interpret UTF-8 encoded text as Latin 1, thereby turning
Fööbär
intoFööbär
which then of course won't change anymore again.Note the difference between text (the actual thing you care about) and the encoded text which is just a bunch of bytes and the knowledge how to turn them into text again. If you treat the latter as the former, problems arise. If you convert properly from one representation to the other, it's fine.