如何在 Python 中获得可靠的 unicode 字符数?
Google App Engine 使用 Python 2.5.2,显然启用了 UCS4。但 GAE 数据存储在内部使用 UTF-8。因此,如果您将 u'\ud834\udd0c'(长度 2)存储到数据存储中,当您检索它时,您会得到 '\U0001d10c'(长度 1)。我正在尝试以在存储之前和之后给出相同结果的方式计算字符串中的 unicode 字符数。因此,我一收到字符串就尝试对其进行标准化(从 u'\ud834\udd0c' 到 '\U0001d10c'),然后计算其长度并将其放入数据存储中。我知道我可以将其编码为 UTF-8,然后再次解码,但是有没有更直接/有效的方法?
Google App Engine uses Python 2.5.2, apparently with UCS4 enabled. But the GAE datastore uses UTF-8 internally. So if you store u'\ud834\udd0c' (length 2) to the datastore, when you retrieve it, you get '\U0001d10c' (length 1). I'm trying to count of the number of unicode characters in the string in a way that gives the same result before and after storing it. So I'm trying to normalize the string (from u'\ud834\udd0c' to '\U0001d10c') as soon as I receive it, before calculating its length and putting it in the datastore. I know I can just encode it to UTF-8 and then decode again, but is there a more straightforward/efficient way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,当您输入“UCS-4 字符串中的 UTF-16 代理项”时,这是解决问题的常用习惯用法。但正如 Mechanical snail 所说,这个输入格式错误,您应该优先修复产生它的任何内容。
嗯...您可以使用正则表达式手动完成它,例如:
当然不是更简单...我也怀疑它是否实际上更有效!
Yes, that's the usual idiom to fix up the problem when you have “UTF-16 surrogates in UCS-4 string” input. But as Mechanical snail said, this input is malformed and you should be fixing whatever produced it in preference.
Well... you could do it manually with a regex, like:
Certainly not more straightforward... I also have my doubts as to whether it's actually more efficient!
不幸的是,CPython 解释器在 3.3 之前的版本中的行为取决于它是使用“窄”还是“宽”Unicode 支持构建的。因此,相同的代码(例如对
len
的调用)在标准解释器的不同版本中可能会产生不同的结果。有关示例,请参阅此问题。“窄”和“宽”之间的区别在于“窄”解释器内部存储 16 位代码单元 (UCS-2),而“宽”解释器内部存储 32 位代码单元 (UCS-4)。 U+10000 及以上代码点(在基本多语言平面之外)在“窄”解释器上的
len
为 2,因为两个 UCS-2 代码单元<需要 /em> 来表示它们(使用代理),这就是 len 测量的内容。在“宽”构建上,非 BMP 代码点只需要一个 UCS-4 代码单元,因此对于这些构建,len
为一个用于此类代码点。我已经确认下面的代码可以处理所有
unicode
字符串,无论它们是否包含代理项对,并且可以在 CPython 2.7 窄版和宽版中工作。 (可以说,在宽解释器中指定像u'\ud83d\udc4d'
这样的字符串反映了表示完整代理代码点的肯定愿望,与部分字符不同代码 unit ,因此不会自动纠正错误,但我在这里忽略了它,这是一个边缘情况,通常不是所需的用例。)下面使用的
@invoke
技巧是一种避免重复计算的方法,无需向模块的__dict__
添加任何内容。Unfortunately, the behavior of the CPython interpreter in versions earlier than 3.3 depends on whether it is built with "narrow" or "wide" Unicode support. So the same code, such as a call to
len
, can have a different result in different builds of the standard interpreter. See this question for examples.The distinction between "narrow" and "wide" is that "narrow" interpreters internally store 16-bit code units (UCS-2), whereas "wide" interpreters internally store 32-bit code units (UCS-4). Code points U+10000 and above (outside the basic-multilingual plane) have a
len
of two on "narrow" interpreters because two UCS-2 code units are needed to represent them (using surrogates), and that's whatlen
measures. On "wide" builds only a single UCS-4 code unit is required for a non-BMP code point, so for those buildslen
is one for such code points.I have confirmed that the below handles all
unicode
strings whether or not they contain surrogate pairs, and works in CPython 2.7 both narrow and wide builds. (Arguably, specifying a string likeu'\ud83d\udc4d'
in a wide interpreter reflects an affirmative desire to represent a complete surrogate code point as distinct from a partial-character code unit and is therefore not automatically an error to be corrected, but I'm ignoring that here. It's an edge case and normally not a desired use case.)The
@invoke
trick used below is a way to avoid repeat computation without adding anything to the module's__dict__
.