在Python中确定unicode字符串的解码方法的最佳方法是什么
我想知道如何确定 unicode 的编码。
我知道我在某处读过有关此内容的内容,我只是不记得这是否可能,但我想相信有一种方法。
假设我有一个带有 latin-1 编码的 unicode,我想使用解码时使用的相同编码动态编码它......
坦率地说,我想将其转换为 utf-8 unicode 而不会弄乱字符在使用它之前。
IE:
latin1_unicode = 'åäö'.decode('latin-1')
utf8_unicode = latin.encode('latin-1').decode('utf-8')
I was wondering how to determine the encoding of a unicode.
I know I've read about this somewhere, I just don't remember if it was possible or not but I want to believe there was a way.
Let's say I have a unicode with latin-1 encoding, I'd like to dynamically encode it with the same encoding used when decoding it...
Frankly I'd like to turn it into a utf-8 unicode without messing up the characters before working with it.
I.e:
latin1_unicode = 'åäö'.decode('latin-1')
utf8_unicode = latin.encode('latin-1').decode('utf-8')
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果在“确定unicode的编码”中,“unicode”是python数据类型,那么你就不能这样做,因为“编码”指的是输入字符串时表示字符串的原始字节模式(例如,从文件、数据库,凡是你能想到的)。当它变成 python 'unicode' 类型(内部表示)时,字符串要么已经在行后面解码,要么抛出解码异常,因为字节序列与系统编码不相符。
Shadyabhi 的答案指的是(常见)情况,您正在从文件中读取字节(您很可能将其填充在字符串中 - 而不是 python unicode 字符串)并且需要猜测其中的内容编码它们被保存。严格来说,你不能有一个“latin1 unicode python string”:unicode python string没有编码(编码可以定义为将字符转换为字节模式的过程,而解码则为逆过程;因此,解码后的字符串没有编码)编码 - 尽管可以以多种方式对其进行编码以用于存储/外部表示目的)。
例如在我的机器上:
这意味着,在您的示例中,如果默认编码恰好是 UTF-8、UTF-16 或与 latin1 不同的任何内容,latin1_unicode 将包含垃圾。
所以你(可能)想要做的是:
If, in "determine the encoding of a unicode", "unicode" is the python data type, then you cannot do it, as "encoding" refers to the original byte patterns that represented the string when it was input (say, read from a file, a database, you name it). By the time it becomes a python 'unicode' type (an internal representation) the string has either been decoded behind the lines or has thrown a decoding exception because a byte sequence did not jibe with the system encoding.
Shadyabhi's answer refers to the (common) case in which you are reading bytes from a file (which you could be very well be stuffing in a string - not a python unicode string) and need to guess in what encoding they were saved. Strictly speaking, you cannot have a "latin1 unicode python string": a unicode python string has no encoding (encoding may be defined as the process that translates a character to a byte pattern and decoding as the inverse process; a decoded sring has therfore no encoding - though it can be encoded in several ways for storage/external representation purposes).
For instance on my machine:
Which means that, in your example, latin1_unicode will contain garbage if the default encoding happens to be UTF-8, or UTF-16, or anything different from latin1.
So what you (may) want to to do is: