我正在编写一个脚本,它将尝试将字节编码为 Python 2.6 中的许多不同编码。有什么方法可以获得我可以迭代的可用编码列表吗?
我尝试这样做的原因是因为用户有一些未正确编码的文本。有有趣的角色。我知道 unicode 字符弄乱了它。我希望能够给他们一个答案,例如“您的文本编辑器将该字符串解释为 X 编码,而不是 Y 编码”。我想我会尝试使用一种编码对该字符进行编码,然后使用另一种编码再次对其进行解码,然后看看我们是否得到相同的字符序列。
即像这样:
for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
try:
unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
except:
pass
I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?
The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.
i.e. something like this:
for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
try:
unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
except:
pass
发布评论
评论(10)
这里的其他答案似乎表明以编程方式构建这个列表是困难的并且充满陷阱。然而,这样做可能是不必要的,因为文档包含 Python 支持的标准编码的完整列表,并且自 Python 2.3 以来就已经这样做了。
您可以在以下位置找到这些列表(针对迄今为止发布的该语言的每个稳定版本):
以下是列出每个已记录的 Python 版本。请注意,如果您想要向后兼容而不仅仅是支持特定版本的 Python,则只需从最新 Python 版本和 在尝试使用每种编码之前,检查运行程序的 Python 中是否存在。
Python 2.3(59 种编码)
Python 2.4(85 种编码)
Python 2.5(86 种编码)
Python 2.6(90 种编码)
Python 2.7(93 种编码)
Python 3.0(89 种编码)
Python 3.1(90 种编码)
Python 3.2(92 种编码)
Python 3.3 (93 种编码)
Python 3.4(96 种编码)
Python 3.5(98 种编码)
Python 3.6(98 种编码)
与以前的版本相同。
Python 3.7(98 种编码)
与以前的版本相同。 em>
Python 3.8(97 种编码)
Python 3.9(97 种编码)
与先前版本相同。
Python 3.10(97 种编码)
与先前版本相同。
Python 3.11(97 种编码)
与以前的版本相同。
Python 3.12(97 种编码)
与以前的版本相同。
如果它们与任何人的用例相关,请注意文档还列出了一些Python 特定的编码,其中许多似乎主要供Python 的内部结构在某些方面很奇怪,例如
'undefined'
编码,如果您尝试使用它,它总是会抛出异常。如果您像这里的提问者一样,试图弄清楚您在现实世界中遇到的某些文本使用了什么编码,那么您可能想完全忽略这些。从 Python 3.7 开始,列表如下:一些较旧的 Python 版本具有
string_escape
特殊编码,我没有将其包含在上面的列表中,因为它已从语言中删除。最后,如果您想为更新版本的 Python 更新上面的表格,这里是我用来生成它们的(粗略的,不是很健壮的)脚本:
Other answers here seem to indicate that constructing this list programmatically is difficult and fraught with traps. However, doing so is probably unnecessary since the documentation contains a complete list of the standard encodings Python supports, and has done since Python 2.3.
You can find these lists (for each stable version of the language so far released) at:
Below are the lists for each documented version of Python. Note that if you want backwards-compatibility rather than just supporting a particular version of Python, you can just copy the list from the latest Python version and check whether each encoding exists in the Python running your program before trying to use it.
Python 2.3 (59 encodings)
Python 2.4 (85 encodings)
Python 2.5 (86 encodings)
Python 2.6 (90 encodings)
Python 2.7 (93 encodings)
Python 3.0 (89 encodings)
Python 3.1 (90 encodings)
Python 3.2 (92 encodings)
Python 3.3 (93 encodings)
Python 3.4 (96 encodings)
Python 3.5 (98 encodings)
Python 3.6 (98 encodings)
Same as previous version.
Python 3.7 (98 encodings)
Same as previous version.
Python 3.8 (97 encodings)
Python 3.9 (97 encodings)
Same as previous version.
Python 3.10 (97 encodings)
Same as previous version.
Python 3.11 (97 encodings)
Same as previous version.
Python 3.12 (97 encodings)
Same as previous version.
In case they're relevant to anyone's use case, note that the docs also list some Python-specific encodings, many of which seem to be primarily for use by Python's internals or are otherwise weird in some way, like the
'undefined'
encoding which always throws an exception if you try to use it. You probably want to ignore these completely if, like the question-asker here, you're trying to figure out what encoding was used for some text you've come across in the real world. As of Python 3.7, the list is as follows:Some older Python versions had a
string_escape
special encoding that I've not included in the above list because it's been removed from the language.Finally, in case you'd like to update my tables above for a newer version of Python, here's the (crude, not very robust) script I used to generate them:
不幸的是
encodings.aliases.aliases.keys()
不是一个合适的答案。aliases
(正如人们所期望的那样)包含几种情况,其中不同的键映射到相同的值,例如1252
和windows_1252
都映射到cp1252
。如果使用set(aliases.values())
而不是aliases.keys()
,则可以节省时间。但还有一个更糟糕的问题:
aliases
不包含没有别名的编解码器(例如 cp856、cp874、cp875、cp737 和 koi8_u)。还值得注意的是,无论您如何获得编解码器的完整列表,忽略与编码/解码字符集无关的编解码器可能是个好主意,而是进行一些其他转换,例如
zlib
,quopri
和base64
。这给我们带来了一个问题:为什么要“尝试将字节编码为许多不同的编码”。如果我们知道这一点,我们也许能够引导您走向正确的方向。
首先,这是模棱两可的。一种将字节解码为 unicode,一种将 unicode 编码为字节。你想做什么?
您真正想要实现的目标是什么:您是否正在尝试确定使用哪个编解码器来解码某些传入字节,并计划使用所有可能的编解码器进行尝试? [注意:latin1 会解码任何内容] 您是否试图通过尝试使用所有可能的编解码器对其进行编码来确定某些 unicode 文本的语言? [注意:utf8 可以对任何内容进行编码]。
Unfortunately
encodings.aliases.aliases.keys()
is NOT an appropriate answer.aliases
(as one would/should expect) contains several cases where different keys are mapped to the same value e.g.1252
andwindows_1252
are both mapped tocp1252
. You could save time if instead ofaliases.keys()
you useset(aliases.values())
.BUT THERE'S A WORSE PROBLEM:
aliases
doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g.
zlib
,quopri
, andbase64
.Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.
For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?
What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].
也许您应该尝试使用 通用编码检测器 (chardet) 库,而不是自己实现。
Maybe you should try using the Universal Encoding Detector (chardet) library instead of implementing it yourself.
你可以 使用一种技术列出
encodings
包中的所有模块。You could use a technique to list all modules in the
encodings
package.我怀疑编解码器模块中是否有这样的方法/功能,但是如果您看到
encoding/__init__.py
,搜索功能会通过编码模块文件夹进行搜索,因此您可以执行相同的操作,例如,但任何人都可以注册编解码器,因此这不是详尽的列表。
I doubt there is such method/functionality in codecs module, but if you see
encoding/__init__.py
, search function searches thru encodings modules folder, so you may do the same e.g.but as anybody can register a codec, so that won't be exhaustive list.
来自 Python 3.7.6 源代码,位于 < strong>/Tools/unicode/listcodecs.py:
然后:
From Python 3.7.6 Source, under /Tools/unicode/listcodecs.py:
Then:
Python 源代码在
Tools/unicode/listcodecs.py
中有一个脚本,其中列出了所有编解码器。然而,在列出的编解码器中,有一些不是 Unicode 到字节转换器,例如
base64_codec
、quopri_codec
和bz2_codec
,如 @约翰·梅钦指出。The Python source code has a script at
Tools/unicode/listcodecs.py
which lists all codecs.Among the listed codecs, however, there are some that are not Unicode-to-byte converters, like
base64_codec
,quopri_codec
andbz2_codec
, as @John Machin pointed out.这是一种列出 stdlib 编码包中定义的所有编码的编程方式,请注意,这不会列出用户定义的编码。这结合了其他答案中的一些技巧,但实际上使用编解码器的规范名称生成了一个工作列表。
Here's a programmatic way to list all the encodings defined in the stdlib encodings package, note that this won't list user defined encodings. This combines some of the tricks in the other answers but actually produces a working list using the codec's canonical name.
也许你可以这样做:
Probably you can do this: