普通的Python字符串使用什么编码?
我知道 django 在整个框架中使用 unicode 字符串而不是普通的 python 字符串。普通的Python字符串使用什么编码?他们为什么不使用 unicode?
i know that django uses unicode strings all over the framework instead of normal python strings. what encoding are normal python strings use ? and why don't they use unicode?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
在 Python 2 中:普通字符串 (Python 2.x
str
) 没有编码:它们是原始数据。在 Python 3 中:这些被称为“字节”,这是一个准确的描述,因为它们只是字节序列,可以用任何编码(几种常见的编码)进行文本编码!)或完全非文本数据。
为了表示文本,您需要 unicode 字符串,而不是字节字符串。“unicode 字符串”是指 Python 2 和
中的
实例。Unicode 字符串是抽象表示的 unicode 代码点序列,无需编码;这非常适合表示文本。unicode
实例Python 3 中的 >str字节串很重要,因为要表示通过网络传输或写入文件等的数据,您不能拥有 unicode 的抽象表示,您需要字节的具体表示。尽管它们经常被用来存储和表示文本,但这至少有点顽皮。
整个情况变得复杂,因为虽然您应该通过调用
encode
将unicode转换为字节,并使用decode
将字节转换为unicode,但Python会尝试使用您可以设置的全局编码自动为您执行此操作,默认情况下为 ASCII,这是最安全的选择。永远不要在你的代码中依赖它,也永远不要将其更改为更灵活的编码——当你获取字节串时显式解码,如果你需要将字符串发送到外部某个地方则进行编码。In Python 2: Normal strings (Python 2.x
str
) don't have an encoding: they are raw data.In Python 3: These are called "bytes" which is an accurate description, as they are simply sequences of bytes, which can be text encoded in any encoding (several are common!) or non-textual data altogether.
For representing text, you want unicode strings, not byte strings. By "unicode strings", I mean
unicode
instances in Python 2 andstr
instances in Python 3. Unicode strings are sequences of unicode codepoints represented abstractly without an encoding; this is well-suited for representing text.Bytestrings are important because to represent data for transmission over a network or writing to a file or whatever, you cannot have an abstract representation of unicode, you need a concrete representation of bytes. Though they are often used to store and represent text, this is at least a little naughty.
This whole situation is complicated by the fact that while you should turn unicode into bytes by calling
encode
and turn bytes into unicode usingdecode
, Python will try to do this automagically for you using a global encoding you can set that is by default ASCII, which is the safest choice. Never depend on this for your code and never ever change this to a more flexible encoding--explicitly decode when you get a bytestring and encode if you need to send a string somewhere external.嘿!我想在其他答案中添加一些内容,不幸的是我还没有足够的代表来正确地做到这一点:-(
FWIW,迈克·格雷厄姆的帖子非常好,这可能是你应该首先阅读的内容。
这里有一些评论:
from __future__ import unicode_literals
# -*- 编码:utf-8 -*-
。 org/dev/peps/pep-0263/" rel="noreferrer">PEP 0263。更改源编码会影响 Unicode 文字的解释方式(无论其前缀或缺少前缀,受第 1 点影响)在 Py3k 中,默认文件编码是 UTF-8。Pythonstr
,2.x 中为unicode
)。在某个时间点,一些东西必须被写入内存。理想情况下,这对最终用户来说永远不会显而易见。不幸的是,没有什么是完美的,您偶尔会遇到问题:特别是如果您在 Unicode 基本多语言平面之外使用时髦的曲线。从 Python 2.2 开始,我们就有了所谓的“宽”构建和“窄”构建;这些名称指的是内部用于存储 Unicode 代码点的类型。 Wide 版本使用 UCS-4,它使用 4 个字节来存储 Unicode 代码点。 (这意味着 UCS-4 的代码单元大小为 4 字节或 32 位。)窄版本使用 UCS-2。 UCS-2 只有 16 位,因此无法准确编码所有 Unicode 代码点(它类似于 UTF-16,除了没有代理项对)。要进行检查,请测试 sys.maxunicode 的值。如果是1114111
,则您拥有宽构建(可以正确表示所有 Unicode)。如果少了,那就不用太担心了。 BMP(代码点0x0000
到0xFFFF
)可以满足大多数人的需求。有关详细信息,请参阅 PEP 0261。Hey! I'd like to add some stuff to other answers, unfortunately I don't have enough rep yet to do that properly :-(
FWIW, Mike Graham's post is pretty good and that's probably what you should be reading first.
Here's a few comments:
from __future__ import unicode_literals
# -*- coding: utf-8 -*-
. For more information see PEP 0263. Changing the source encoding affects how Unicode literals (regardless of their prefix or lack of prefix, as affected by point 1) are interpreted. In Py3k, the default file encoding is UTF-8.str
in py3k,unicode
in 2.x) because at some point in time stuff's going to have to be written to memory. Ideally, this would never be evident to the end-user. Unfortunately nothing's perfect and you can occasionally run into problems with this: specifically if you use funky squiggles outside of the Unicode Base Multilingual Plane. Since Python 2.2, we've had what's called wide builds and narrow builds; these names refer to the type used internally to store Unicode code points. Wide builds use UCS-4, which uses 4 bytes to store a Unicode code point. (This means UCS-4's code unit size is 4 bytes, or 32 bits.) Narrow builds use UCS-2. UCS-2 only has 16 bits, and therefore can not encode all Unicode code points accurately (it's like UTF-16, except without the surrogate pairs). To check, test the value ofsys.maxunicode
. If it's1114111
, you've got a wide build (which can correctly represent all of Unicode). If it's less, well, don't fret too much. The BMP (code points0x0000
to0xFFFF
) covers most people's needs. For more information, see PEP 0261.在 Python 3.x 中
str
是 Unicode。这可能是 UTF-16 或 UTF-32,具体取决于您的 Python 解释器是使用“窄”还是“宽”Unicode 字符构建的。Windows 版本的 CPython 使用 UTF-16。在类 Unix 系统上,UTF-32 往往是首选。
在 Python 2.x 中,
str
是一种字节字符串类型,类似于 Cchar
。编码不是由语言定义的,而是由您的语言环境的默认编码决定的。或者无论您从 Internet 上获取的文档的 MIME 字符集是什么。或者,如果您从像 struct.pack 这样的函数获取字符串,那么它是二进制数据,并且根本没有有意义的字符编码。2.x 中的
unicode
字符串相当于 3.x 中的str
。因为 Python(稍微)早于 Unicode。因为 Guido 希望保留 3.0 中所有主要的向后不兼容的更改。 3.x 中的字符串默认使用 Unicode。
In Python 3.x
str
is Unicode. This may be either UTF-16 or UTF-32 depending on whether your Python interpreter was built with "narrow" or "wide" Unicode characters.The Windows version of CPython uses UTF-16. On Unix-like systems, UTF-32 tends to be preferred.
In Python 2.x
str
is a byte string type like Cchar
. The encoding isn't defined by the language, but is whatever your locale's default encoding is. Or whatever the MIME charset of the document you got off the Internet is. Or, if you get a string from a function likestruct.pack
, it's binary data, and doesn't meaningfully have a character encoding at all.unicode
strings in 2.x are equivalent tostr
in 3.x.Because Python (slightly) predates Unicode. And because Guido wanted to save all the major backwards-incompatible changes for 3.0. Strings in 3.x do use Unicode by default.
从Python 3.0开始,所有字符串默认都是unicode,还有字节数据类型(Python 文档)。
所以Python开发者认为使用unicode是一个好主意,它在Python 2中没有被普遍使用主要是由于向后兼容性。它还具有性能影响。
From Python 3.0 on all strings are unicode by default, there is also the bytes datatype (Python documentation).
So the python developers think that using unicode is a good idea, that it is not used universally in python 2 is mostly due to backwards compatibility. It also has performance implications.
Python 2.x 字符串是 8 位,仅此而已。编码可能会有所不同(尽管假定为 ASCII)。我想其中的原因是有历史原因的。很少有语言,尤其是上世纪的语言,会立即使用 unicode。
在 Python 3 中,所有字符串都是 unicode。
Python 2.x strings are 8-bit, nothing more. The encoding may vary (though ASCII is assumed). I guess the reasons are historical. Few languages, especially languages that date back to the last century, use unicode right away.
In Python 3, all strings are unicode.
在 Python 3.0 之前,字符串编码默认为
ascii
,但可以更改。 Unicode 字符串文字为u"..."
。这太愚蠢了。Before Python 3.0, string encoding was
ascii
by default, but could be changed. Unicode string literals wereu"..."
. This was silly.