规范 Unicode 字符串形式
我有一个 Unicode 字符串编码,例如 UTF8
。 Unicode 中的一个字符串可以有几个字节表示。我想知道,是否有任何或可以创建任何规范(规范化)形式的 Unicode 字符串 - 所以我们可以将这些字符串与 memcmp(3)
等进行比较。可以例如 ICU 或任何其他 C/C++
库可以做到这一点吗?
I have a Unicode string encoded, say, as UTF8
. One string in Unicode can have few byte representations. I wonder, is there any or can be created any canonical (normalized) form of Unicode string -- so we can e.g. compare such strings with memcmp(3)
etc. Can e.g. ICU or any other C/C++
library do that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可能正在寻找Unicode 规范化。本质上有四种不同的范式,每种范式都确保所有等效字符串之后都有一个共同的形式。但是,在许多情况下,您还需要考虑区域设置,因此虽然这可能是进行字节到字节比较的廉价方法(如果您确保相同的 Unicode 转换格式,例如 UTF-8 或 UTF-16)和相同的范式)除了有限的用例之外,它不会给你带来太多好处。
You might be looking for Unicode normalisation. There are essentially four different normal forms that each ensure that all equivalent strings have a common form afterwards. However, in many instances you need to take locale into account as well, so while this may be a cheap way of doing a byte-to-byte comparison (if you ensure the same Unicode transformation format, like UTF-8 or UTF-16 and the same normal form) it won't gain you much apart from that limited use case.
比较 Unicode 代码点序列:
UTF-8 本身就是一种规范表示形式。由相同 Unicode 代码点组成的两个 Unicode 字符串将始终被编码为完全相同的 UTF-8 字节序列,因此可以与
memcmp
进行比较。这是UTF-8编码的必要属性,否则不容易解码。但我们可以更进一步,对于所有官方 Unicode 编码方案(UTF-8、UTF-16 和 UTF-32)都是如此。它们将字符串编码为不同的字节序列,但它们始终将相同的字符串编码为相同的序列。如果考虑字节顺序和平台独立性,则推荐使用 UTF-8 编码方案,因为在读取或写入 16 位或 32 位值时不必处理字节顺序。所以答案是,如果两个字符串使用相同的编码方案(例如 UTF-8)和字节序(这不是 UTF-8 的问题)进行编码,则生成的字节序列将相同。
比较 Unicode 字符串:
还有一个更难处理的问题。在 Unicode 中,某些字形(您在屏幕或纸张上看到的字符)可以用单个代码点或两个连续代码点的组合(称为组合字符)表示。对于带有重音符号、变音符号等的字形通常是这样。由于不同的代码点表示,它们相应的字节序列会有所不同。在考虑这些组合字符的同时比较字符串不能通过简单的字节比较来执行,首先您必须对其进行规范化。
其他答案提到了一些 Unicode 规范化技术、规范形式和库,可用于将 Unicode 字符串转换为其规范形式。然后您将能够将它们与任何编码方案进行逐字节比较。
Comparing Unicode codepoint sequences:
UTF-8 is a canonical representation itself. Two Unicode strings that are composed of the same Unicode codepoints will always be encoded to exactly the same UTF-8 byte sequence and thus can be compared with
memcmp
. It is a necessary property of the UTF-8 encoding, otherwise it would not be easily decodable. But we can go further, this is true for all official Unicode encoding schemes, UTF-8, UTF-16 and UTF-32. They encode a string to different byte sequences, but they always encode the same string to the same sequence. If you consider endianness and platform independence, UTF-8 is the recommended encoding scheme because you don't have to deal with byte orders when reading or writing 16-bit or 32-bit values.So the answer is that if two strings are encoded with the same encoding scheme (eg. UTF-8) and endiannes (it's not an issue with UTF-8), the resulting byte sequence will be the same.
Comparing Unicode strings:
There's an other issue that is more difficult to handle. In Unicode some glyphs (the character you see on the screen or paper) can be represented with a single codepoint or a combination of two consecutive codepoints (called combining characters). This is usually true for glyphs with accents, diacritic marks, etc. Because of the different codepoint representation, their corresponding byte sequence will differ. Comparing strings while taking these combining characters into consideration can not be performed with simple byte comparison, first you have to normalize it.
The other answers mention some Unicode normalization techniques, canonical forms and libraries that you can use for converting Unicode strings to their normal form. Then you will be able to compare them byte-by-byte with any encoding scheme.
您希望将字符串规范化为一种 Unicode 规范化形式。 libicu 可以为您执行此操作,但不能在 UTF-8 字符串上执行此操作。您必须首先使用
ucnv_toUChars
将其转换为 UChar,然后使用unorm_normalize
进行标准化,然后使用ucnv_fromUChars
转换回来。我认为还有一些特定版本的 ucnv_* 用于 UTF-8 编码。如果 memcmp 是您唯一的目标,您当然可以在
unorm_normalize
之后直接在 UChar 数组上执行此操作。You're looking to normalize the string to one of the Unicode normalization forms. libicu can do this for you, but not on a UTF-8 string. You have to first convert it to UChar, using e.g.
ucnv_toUChars
, then normalize withunorm_normalize
, then convert back usingucnv_fromUChars
. I think there's also some specific version of ucnv_* for UTF-8 encoding.If memcmp is your only goal you can of course do that directly on the UChar array after
unorm_normalize
.