如何在 Python 中获得可靠的 unicode 字符数?

发布于 2024-11-27 15:30:19 字数 299 浏览 2 评论 0原文

Google App Engine 使用 Python 2.5.2,显然启用了 UCS4。但 GAE 数据存储在内部使用 UTF-8。因此,如果您将 u'\ud834\udd0c'(长度 2)存储到数据存储中,当您检索它时,您会得到 '\U0001d10c'(长度 1)。我正在尝试以在存储之前和之后给出相同结果的方式计算字符串中的 unicode 字符数。因此,我一收到字符串就尝试对其进行标准化(从 u'\ud834\udd0c' 到 '\U0001d10c'),然后计算其长度并将其放入数据存储中。我知道我可以将其编码为 UTF-8,然后再次解码,但是有没有更直接/有效的方法?

Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there a more straightforward/efficient way?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

来世叙缘 2024-12-04 15:30:19

我知道我可以将其编码为 UTF-8,然后再次解码

是的,当您输入“UCS-4 字符串中的 UTF-16 代理项”时,这是解决问题的常用习惯用法。但正如 Mechanical snail 所说,这个输入格式错误,您应该优先修复产生它的任何内容。

有没有更直接/有效的方法?

嗯...您可以使用正则表达式手动完成它,例如:

re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)

当然不是更简单...我也怀疑它是否实际上更有效!

I know I can just encode it to UTF-8 and then decode again

Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.

is there a more straightforward/efficient way?

Well... you could do it manually with a regex, like:

re.sub(
    u'([\uD800-\uDBFF])([\uDC00-\uDFFF])',
    lambda m: unichr((ord(m.group(1))-0xD800<<10)+ord(m.group(2))-0xDC00+0x10000),
    s
)

Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!

九局 2024-12-04 15:30:19

不幸的是,CPython 解释器在 3.3 之前的版本中的行为取决于它是使用“窄”还是“宽”Unicode 支持构建的。因此,相同的代码(例如对 len 的调用)在标准解释器的不同版本中可能会产生不同的结果。有关示例,请参阅此问题

“窄”和“宽”之间的区别在于“窄”解释器内部存储 16 位代码单元 (UCS-2),而“宽”解释器内部存储 32 位代码单元 (UCS-4)。 U+10000 及以上代码点(在基本多语言平面之外)在“窄”解释器上的 len 为 2,因为两个 UCS-2 代码单元<需要 /em> 来表示它们(使用代理),这就是 len 测量的内容。在“宽”构建上,非 BMP 代码只需要一个 UCS-4 代码单元,因此对于这些构建,len 为一个用于此类代码点。

我已经确认下面的代码可以处理所有 unicode 字符串,无论它们是否包含代理项对,并且可以在 CPython 2.7 窄版和宽版中工作。 (可以说,在宽解释器中指定像 u'\ud83d\udc4d' 这样的字符串反映了表示完整代理代码的肯定愿望,与部分字符不同代码 unit ,因此不会自动纠正错误,但我在这里忽略了它,这是一个边缘情况,通常不是所需的用例。

)下面使用的 @invoke 技巧是一种避免重复计算的方法,无需向模块的 __dict__ 添加任何内容。

invoke = lambda f: f()  # trick taken from AJAX frameworks

@invoke
def codepoint_count():
  testlength = len(u'\U00010000')  # pre-compute once
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:
    def closure(data):  # count function for "wide" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:
    def is_surrogate(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):  # count function for "narrow" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(is_surrogate, data))
  return closure

assert codepoint_count(u'hello \U0001f44d') == 7
assert codepoint_count(u'hello \ud83d\udc4d') == 7

Unfortunately, the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to len, can have a different result in different builds of the standard interpreter. See this question for examples.

The distinction between "narrow" and "wide" is that "narrow" interpreters internally store 16-bit code units (UCS-2), whereas "wide" interpreters internally store 32-bit code units (UCS-4). Code points U+10000 and above (outside the basic-multilingual plane) have a len of two on "narrow" interpreters because two UCS-2 code units are needed to represent them (using surrogates), and that's what len measures. On "wide" builds only a single UCS-4 code unit is required for a non-BMP code point, so for those builds len is one for such code points.

I have confirmed that the below handles all unicode strings whether or not they contain surrogate pairs, and works in CPython 2.7 both narrow and wide builds. (Arguably, specifying a string like u'\ud83d\udc4d' in a wide interpreter reflects an affirmative desire to represent a complete surrogate code point as distinct from a partial-character code unit and is therefore not automatically an error to be corrected, but I'm ignoring that here. It's an edge case and normally not a desired use case.)

The @invoke trick used below is a way to avoid repeat computation without adding anything to the module's __dict__.

invoke = lambda f: f()  # trick taken from AJAX frameworks

@invoke
def codepoint_count():
  testlength = len(u'\U00010000')  # pre-compute once
  assert (testlength == 1) or (testlength == 2)
  if testlength == 1:
    def closure(data):  # count function for "wide" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data.encode('UTF-16BE').decode('UTF-16BE'))
  else:
    def is_surrogate(c):
      ordc = ord(c)
      return (ordc >= 55296) and (ordc < 56320)
    def closure(data):  # count function for "narrow" interpreter
      u'returns the number of Unicode code points in a unicode string'
      return len(data) - len(filter(is_surrogate, data))
  return closure

assert codepoint_count(u'hello \U0001f44d') == 7
assert codepoint_count(u'hello \ud83d\udc4d') == 7
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文