在Python中确定unicode字符串的解码方法的最佳方法是什么

发布于 2024-12-29 05:31:01 字数 326 浏览 1 评论 0原文

我想知道如何确定 unicode 的编码。

我知道我在某处读过有关此内容的内容，我只是不记得这是否可能，但我想相信有一种方法。

假设我有一个带有 latin-1 编码的 unicode，我想使用解码时使用的相同编码动态编码它......

坦率地说，我想将其转换为 utf-8 unicode 而不会弄乱字符在使用它之前。

IE：

latin1_unicode = 'åäö'.decode('latin-1')
utf8_unicode = latin.encode('latin-1').decode('utf-8')

原文

I was wondering how to determine the encoding of a unicode.

I know I've read about this somewhere, I just don't remember if it was possible or not but I want to believe there was a way.

Let's say I have a unicode with latin-1 encoding, I'd like to dynamically encode it with the same encoding used when decoding it...

Frankly I'd like to turn it into a utf-8 unicode without messing up the characters before working with it.

I.e:

latin1_unicode = 'åäö'.decode('latin-1')
utf8_unicode = latin.encode('latin-1').decode('utf-8')

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

向地狱狂奔 2025-01-05 05:31:01

如果在“确定unicode的编码”中，“unicode”是python数据类型，那么你就不能这样做，因为“编码”指的是输入字符串时表示字符串的原始字节模式（例如，从文件、数据库，凡是你能想到的）。当它变成 python 'unicode' 类型（内部表示）时，字符串要么已经在行后面解码，要么抛出解码异常，因为字节序列与系统编码不相符。

Shadyabhi 的答案指的是（常见）情况，您正在从文件中读取字节（您很可能将其填充在字符串中 - 而不是 python unicode 字符串）并且需要猜测其中的内容编码它们被保存。严格来说，你不能有一个“latin1 unicode python string”：unicode python string没有编码（编码可以定义为将字符转换为字节模式的过程，而解码则为逆过程；因此，解码后的字符串没有编码）编码 - 尽管可以以多种方式对其进行编码以用于存储/外部表示目的）。

例如在我的机器上：

In [35]: sys.stdin.encoding
Out[35]: 'UTF-8'

In [36]: a='è'.decode('UTF-8')

In [37]: b='è'.decode('latin-1')

In [38]: a
Out[38]: u'\xe8'

In [39]: b
Out[39]: u'\xc3\xa8'
In [41]: sys.stdout.encoding
Out[41]: 'UTF-8'

In [42]: print b #it's garbage
Ã¨

In [43]: print a #it's OK
è

这意味着，在您的示例中，如果默认编码恰好是 UTF-8、UTF-16 或与 latin1 不同的任何内容，latin1_unicode 将包含垃圾。

所以你（可能）想要做的是：

确定数据源的编码 - 也许使用 Shadyabhi 的方法之一
根据 (1) 解码数据，将其保存在 python unicode 字符串中
使用原始编码对其进行编码（如果是）什么满足您的需求）或您选择的其他编码。

If, in "determine the encoding of a unicode", "unicode" is the python data type, then you cannot do it, as "encoding" refers to the original byte patterns that represented the string when it was input (say, read from a file, a database, you name it). By the time it becomes a python 'unicode' type (an internal representation) the string has either been decoded behind the lines or has thrown a decoding exception because a byte sequence did not jibe with the system encoding.

Shadyabhi's answer refers to the (common) case in which you are reading bytes from a file (which you could be very well be stuffing in a string - not a python unicode string) and need to guess in what encoding they were saved. Strictly speaking, you cannot have a "latin1 unicode python string": a unicode python string has no encoding (encoding may be defined as the process that translates a character to a byte pattern and decoding as the inverse process; a decoded sring has therfore no encoding - though it can be encoded in several ways for storage/external representation purposes).

For instance on my machine:

In [35]: sys.stdin.encoding
Out[35]: 'UTF-8'

In [36]: a='è'.decode('UTF-8')

In [37]: b='è'.decode('latin-1')

In [38]: a
Out[38]: u'\xe8'

In [39]: b
Out[39]: u'\xc3\xa8'
In [41]: sys.stdout.encoding
Out[41]: 'UTF-8'

In [42]: print b #it's garbage
Ã¨

In [43]: print a #it's OK
è

Which means that, in your example, latin1_unicode will contain garbage if the default encoding happens to be UTF-8, or UTF-16, or anything different from latin1.

So what you (may) want to to do is:

Ascertain the encoding of your data source - perhaps using one of Shadyabhi's methods
Decode the data according to (1), save it in python unicode strings
Encode it using the original encoding (if that's what serves your needs) or some other encoding of your choosing.

回复收藏 0 原文

~没有更多了~