sys.maxunicode 是什么意思?
CPython 根据编译选项在内部将 unicode 字符串存储为 utf-16 或 utf-32。在 utf-16 版本中,Python 字符串切片、迭代和 len 似乎适用于代码单元,而不是代码点,因此多字节字符的行为很奇怪。
例如,在 CPython 2.6 上,sys.maxunicode
= 65535:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
根据 Python 文档,sys.maxunicode
是“给出 Unicode 字符支持的最大代码点的整数。”
这是否意味着 unicode
操作不能保证在 sys.maxunicode
之外的代码点上工作?如果我想使用 BMP 之外的字符,我要么必须使用 utf-32 构建,要么编写自己的可移植 unicode
操作?
CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len
seem to work on code units, not code points, so that multibyte characters behave strangely.
E.g., on CPython 2.6 with sys.maxunicode
= 65535:
>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'
According to the Python documentation, sys.maxunicode
is "An integer giving the largest supported code point for a Unicode character."
Does this mean that unicode
operations aren't guranteed to work on code points beyond sys.maxunicode
? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode
operations?
I came across this problem in How to iterate over Unicode characters in Python 3?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
超出
sys.maxunicode=65535
的字符使用 UTF-16 代理在内部存储。是的,您必须自己处理这个问题或使用广泛的构建。即使使用广泛的构建,您也可能必须处理由代码点组合表示的单个字符。例如:第一个使用组合重音字符,第二个则不使用。两者打印相同。您可以使用
unicodedata.normalize
来转换表单。Characters beyond
sys.maxunicode=65535
are stored internally using UTF-16 surrogates. Yes you have to deal with this yourself or use a wide build. Even with a wide build you also may have to deal with single characters represented by a combination of code points. For example:The first uses a combining accent character and the second doesn't. Both print the same. You can use
unicodedata.normalize
to convert the forms.