sys.maxunicode 是什么意思？

发布于 2024-12-06 02:33:04 字数 814 浏览 2 评论 0原文

CPython 根据编译选项在内部将 unicode 字符串存储为 utf-16 或 utf-32。在 utf-16 版本中，Python 字符串切片、迭代和 len 似乎适用于代码单元，而不是代码点，因此多字节字符的行为很奇怪。

例如，在 CPython 2.6 上，sys.maxunicode = 65535：

>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'

根据 Python 文档，sys.maxunicode 是“给出 Unicode 字符支持的最大代码点的整数。”

这是否意味着 unicode 操作不能保证在 sys.maxunicode 之外的代码点上工作？如果我想使用 BMP 之外的字符，我要么必须使用 utf-32 构建，要么编写自己的可移植 unicode 操作？

我在如何在Python中迭代Unicode字符中遇到了这个问题3？

原文

CPython stores unicode strings as either utf-16 or utf-32 internally depending on compile options. In utf-16 builds of Python string slicing, iteration, and len seem to work on code units, not code points, so that multibyte characters behave strangely.

E.g., on CPython 2.6 with sys.maxunicode = 65535:

>>> char = u'\U0001D49E'
>>> len(char)
2
>>> char[0:1]
u'\uu835'
>>> char[1:2]
u'\udc9e'

According to the Python documentation, sys.maxunicode is "An integer giving the largest supported code point for a Unicode character."

Does this mean that unicode operations aren't guranteed to work on code points beyond sys.maxunicode? If I want to work with characters outside the BMP I either have to use a utf-32 build or write my own portable unicode operations?

I came across this problem in How to iterate over Unicode characters in Python 3?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

流年里的时光 2024-12-13 02:33:04

超出 sys.maxunicode=65535 的字符使用 UTF-16 代理在内部存储。是的，您必须自己处理这个问题或使用广泛的构建。即使使用广泛的构建，您也可能必须处理由代码点组合表示的单个字符。例如：

>>> print('a\u0301')
á
>>> print('\xe1')
á

第一个使用组合重音字符，第二个则不使用。两者打印相同。您可以使用 unicodedata.normalize 来转换表单。

Characters beyond sys.maxunicode=65535 are stored internally using UTF-16 surrogates. Yes you have to deal with this yourself or use a wide build. Even with a wide build you also may have to deal with single characters represented by a combination of code points. For example:

>>> print('a\u0301')
á
>>> print('\xe1')
á

The first uses a combining accent character and the second doesn't. Both print the same. You can use unicodedata.normalize to convert the forms.

回复收藏 0 原文

~没有更多了~

关于作者

泛滥成性

暂无简介

文章

1011 人气

关注发私信

友情链接

文江博客

sys.maxunicode 是什么意思？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

眼泪淡了忧伤

corot39

守护在此方

github_3h15MP3i7

相思故

滥情空心

友情链接

sys.maxunicode 是什么意思？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

眼泪淡了忧伤

corot39

守护在此方

github_3h15MP3i7

相思故

滥情空心

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。