当前位置：文江博客话题详情

python 中 unicode 字符串的补充代码点

发布于 2025-01-05 03:44:39 字数 231 浏览 0 评论 0原文

当在没有 --enable-unicode=ucs4 的情况下编译 cpython 时，unichr(0x10000) 会失败并出现 ValueError。

是否有语言内置或核心库函数可以将任意 unicode 标量值或代码点转换为 unicode 字符串，无论程序运行在哪种类型的 python 解释器上，该字符串都可以工作？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

遥远的绿洲 2025-01-12 03:44:39

是的，你明白了：

>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'

要理解的关键点是 unichr() 将整数转换为 Python 解释器字符串编码中的单个代码单元。 2.7.3 的 Python 标准库文档，2。内置函数，在 unichr() 上读取，

返回一个字符的Unicode字符串，其Unicode代码是整数i....参数的有效范围取决于Python的配置方式 - 它可以是UCS2 [0..0xFFFF ] 或 UCS4 [0..0x10FFFF]。否则会引发 ValueError 。

我强调了“一个字符”，它们的意思是 Unicode 术语中的“一个代码单元”。

我假设您使用的是 Python 2.x。 Python 3.x 解释器没有内置的 unichr() 函数。相反，3.3.0 的 Python 标准库文档，2。内置函数，在 chr() 上读取，

返回表示字符的字符串，其 Unicode 代码点是整数 i...。参数的有效范围是从 0 到 1,114,111（基数为 16 的 0x10FFFF）。

请注意，返回值现在是一个未指定长度的字符串，而不是具有单个代码单元的字符串。因此，在 Python 3.x 中，chr(0x10000) 将按照您的预期运行。它“将任意 unicode 标量值或代码点转换为 unicode 字符串，无论程序运行在哪种 Python 解释器上，该字符串都可以工作”。

回到 Python 2.x。如果您使用 unichr() 创建 Python 2.x unicode 对象，并且您使用高于 0xFFFF 的 Unicode 标量值，那么您将提交代码以了解Python 解释器对 unicode 对象的实现。

您可以使用一个函数来隔离这种意识，该函数在标量值上尝试 unichr()，捕获 ValueError，然后使用相应的 UTF-16 代理项对再次尝试：

def unichr_supplemental(scalar):
     try:
         return unichr(scalar)
     except ValueError:
         return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
               +unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )

>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)

但是您可能发现将标量转换为 UTF-32 字节 string 中的 4 字节 UTF-32 值，并将该字节 string 解码为 unicode< /代码> string：

>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)

上面的代码在 Python 2.6.7 上测试，使用 Unicode 字符串的 UTF-16 编码。我没有在使用 Unicode 字符串的 UTF-32 编码的 Python 2.x 解释器上对其进行测试。但是，它应该在具有任何 Unicode 字符串实现的任何 Python 2.x 解释器上保持不变。

Yes, here you go:

>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'

The crucial point to understand is that unichr() converts an integer to a single code unit in the Python interpreter's string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr() reads,

Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.

I added emphasis to "one character", by which they mean "one code unit" in Unicode terms.

I'm assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in unichr() function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr() reads,

Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).

Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x, chr(0x10000) would behave as you expected. It "converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on".

But back to Python 2.x. If you use unichr() to create Python 2.x unicode objects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter's implementation of unicode objects.

You can isolate this awareness with a function which tries unichr() on a scalar value, catches ValueError, and tries again with the corresponding UTF-16 surrogate pair:

def unichr_supplemental(scalar):
     try:
         return unichr(scalar)
     except ValueError:
         return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
               +unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )

>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)

But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte string, and decode this byte string into a unicode string:

>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)

The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn't test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.

回复收藏 0 原文

~没有更多了~