标准化 UTF-8 到底是什么?

发布于 2024-12-12 14:36:31 字数 310 浏览 0 评论 0原文

ICU 项目(现在也有一个 PHP 库)包含帮助标准化 UTF-8 字符串所需的类,以便在搜索时更轻松地比较值。

但是,我试图弄清楚这对应用程序意味着什么。例如,在什么情况下我需要“规范等效”而不是“兼容性等效”,反之亦然?

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching.

However, I'm trying to figure out what this means for applications. For example, in which cases do I want "Canonical Equivalence" instead of "Compatibility equivalence", or vis-versa?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

幻想少年梦 2024-12-19 14:36:31

关于 Unicode 规范化您永远想知道的一切

规范规范化

Unicode 包括多种对某些字符(尤其是重音字符)进行编码的方法。规范标准化将代码点更改为规范编码形式。生成的代码点应与原始代码点相同,除非字体或渲染引擎中存在任何错误。

何时使用

由于结果看起来相同,因此在存储或显示字符串之前对其应用规范标准化始终是安全的,只要您可以容忍结果与输入不逐位相同即可。

规范标准化有两种形式:NFD 和 NFC。从某种意义上说,两者是等价的,可以在这两种形式之间毫无损失地进行转换。在 NFC 下比较两个字符串将始终给出与在 NFD 下比较它们相同的结果。

NFD

NFD 的字符完全展开。这是计算速度更快的标准化形式,但会产生更多代码点(即使用更多空间)。

如果您只想比较两个尚未标准化的字符串,那么这是首选的标准化形式,除非您知道需要兼容性标准化。

NFC

运行 NFD 算法后,NFC 会尽可能重新组合代码点。这需要更长的时间,但会产生更短的字符串。

兼容性规范化

Unicode 还包括许多实际上不属于但在旧字符集中使用的字符。 Unicode 添加了这些,以允许将这些字符集中的文本作为 Unicode 进行处理,然后无损地转换回来。

兼容性规范化将这些转换为相应的“真实”字符序列,并执行规范规范化。兼容性标准化的结果可能与原始结果不同。

包含格式信息的字符将替换为不包含格式信息的字符。例如,字符 被转换为 9。其他不涉及格式差异。例如,罗马数字字符 被转换为常规字母 IX

显然,一旦执行了这种转换,就不再可能无损地转换回原始字符集。

何时使用

Unicode 联盟建议考虑兼容性规范化,例如 ToUpperCase 转换。它在某些情况下可能有用,但你不应该随意应用它。

一个很好的用例是搜索引擎,因为您可能希望搜索 9 来匹配

您可能不应该做的一件事是向用户显示应用兼容性标准化的结果。

NFKC/NFKD

兼容性规范化形式有两种形式 NFKD 和 NFKC。它们与 NFD 和 C 之间的关系相同。NFKC

中的任何字符串本质上也在 NFC 中,对于 NFKD 和 NFD 也是如此。因此 NFKD(x)=NFD(NFKC(x))NFKC(x)=NFC(NFKD(x)) 等。

结论

如有疑问,请前往具有规范标准化。根据适用的空间/速度权衡,或根据您要交互操作的对象的要求来选择 NFC 或 NFD。

Everything You Never Wanted to Know about Unicode Normalization

Canonical Normalization

Unicode includes multiple ways to encode some characters, most notably accented characters. Canonical normalization changes the code points into a canonical encoding form. The resulting code points should appear identical to the original ones barring any bugs in the fonts or rendering engine.

When To Use

Because the results appear identical, it is always safe to apply canonical normalization to a string before storing or displaying it, as long as you can tolerate the result not being bit for bit identical to the input.

Canonical normalization comes in 2 forms: NFD and NFC. The two are equivalent in the sense that one can convert between these two forms without loss. Comparing two strings under NFC will always give the same result as comparing them under NFD.

NFD

NFD has the characters fully expanded out. This is the faster normalization form to calculate, but the results in more code points (i.e. uses more space).

If you just want to compare two strings that are not already normalized, this is the preferred normalization form unless you know you need compatibility normalization.

NFC

NFC recombines code points when possible after running the NFD algorithm. This takes a little longer, but results in shorter strings.

Compatibility Normalization

Unicode also includes many characters that really do not belong, but were used in legacy character sets. Unicode added these to allow text in those character sets to be processed as Unicode, and then be converted back without loss.

Compatibility normalization converts these to the corresponding sequence of "real" characters, and also performs canonical normalization. The results of compatibility normalization may not appear identical to the originals.

Characters that include formatting information are replaced with ones that do not. For example the character gets converted to 9. Others don't involve formatting differences. For example the roman numeral character is converted to the regular letters IX.

Obviously, once this transformation has been performed, it is no longer possible to losslessly convert back to the original character set.

When to use

The Unicode Consortium suggests thinking of compatibility normalization like a ToUpperCase transform. It is something that may be useful in some circumstances, but you should not just apply it willy-nilly.

An excellent use case would be a search engine since you would probably want a search for 9 to match .

One thing you should probably not do is display the result of applying compatibility normalization to the user.

NFKC/NFKD

Compatibility normalization form comes in two forms NFKD and NFKC. They have the same relationship as between NFD and C.

Any string in NFKC is inherently also in NFC, and the same for the NFKD and NFD. Thus NFKD(x)=NFD(NFKC(x)), and NFKC(x)=NFC(NFKD(x)), etc.

Conclusion

If in doubt, go with canonical normalization. Choose NFC or NFD based on the space/speed trade-off applicable, or based on what is required by something you are inter-operating with.

走过海棠暮 2024-12-19 14:36:31

某些字符,例如带重音的字母(例如 é)可以用两种方式表示 - 单个代码点 U+00E9 或纯字母后跟组合重音符号U+0065 U+0301。普通标准化将选择其中之一来始终表示它(NFC 的单个代码点,NFD 的组合形式)。

对于可以由多个基本字符序列和组合标记表示的字符(例如,“s,下面的点,上面的点”与将点放在上面然后将点放在下面或使用已经具有其中一个点的基本字符),NFD 将也选择其中一个(下面是第一个,因为它发生了)

兼容性分解包括许多“不应该真正”是字符的字符,但它们是因为它们被用于遗留编码中。普通规范化不会统一这些(为了保持往返完整性 - 这对于组合形式来说不是问题,因为没有遗留编码[除了少数越南编码]同时使用这两种编码),但兼容性规范化会统一。想想一些东亚编码中出现的“kg”公斤符号(或半角/全角片假名和字母表),或者 MacRoman 中的“fi”连字。

有关更多详细信息,请参阅 http://unicode.org/reports/tr15/

Some characters, for example a letter with an accent (say, é) can be represented in two ways - a single code point U+00E9 or the plain letter followed by a combining accent mark U+0065 U+0301. Ordinary normalization will choose one of these to always represent it (the single code point for NFC, the combining form for NFD).

For characters that could be represented by multiple sequences of base characters and combining marks (say, "s, dot below, dot above" vs putting dot above then dot below or using a base character that already has one of the dots), NFD will also pick one of these (below goes first, as it happens)

The compatibility decompositions include a number of characters that "shouldn't really" be characters but are because they were used in legacy encodings. Ordinary normalization won't unify these (to preserve round-trip integrity - this isn't an issue for the combining forms because no legacy encoding [except a handful of vietnamese encodings] used both), but compatibility normalization will. Think like the "kg" kilogram sign that appears in some East Asian encodings (or the halfwidth/fullwidth katakana and alphabet), or the "fi" ligature in MacRoman.

See http://unicode.org/reports/tr15/ for more details.

江挽川 2024-12-19 14:36:31

规范形式(Unicode,而不是数据库)主要(排他地?)处理带有变音符号的字符。 Unicode 提供了一些带有“内置”变音标记的字符,例如 U+00C0、“带有 Grave 的拉丁大写字母 A”。可以从“拉丁大写 A”(U+0041) 和“组合重音”(U+0300) 创建相同的字符。这意味着即使两个序列产生相同的结果字符,一个字节一个字节比较将显示它们完全不同。

规范化是一种尝试,以确保(或至少尝试)所有字符都以相同的方式进行编码 - 要么在需要时使用单独的组合变音标记,要么。全部使用从比较的角度来看,您选择的内容并不重要——几乎任何规范化字符串都会与另一个规范化字符串正确比较,

在这种情况下,“兼容性”意味着与其他规范化字符串的兼容性 。假设一个代码点等于一个字符的代码 如果您有这样的代码,您可能想使用兼容性范式 尽管我从未见过它直接说明,但范式的名称暗示了 Unicode 联盟的考虑。最好使用单独的组合变音符号。这需要更多的智能来计算字符串中的实际字符(以及智能地打破字符串之类的东西),但更通用。

如果您要充分利用 ICU,您很可能想要使用规范范式。如果您尝试自己编写代码(例如)假设代码点等于字符,那么您可能需要兼容性范式来尽可能地实现这一点。

Normal forms (of Unicode, not databases) deal primarily (exclusively?) with characters that have diacritical marks. Unicode provides some characters with "built in" diacritical marks, such as U+00C0, "Latin Capital A with Grave". The same character can be created from a `Latin Capital A" (U+0041) with a "Combining Grave Accent" (U+0300). That means even though the two sequences produce the same resulting character, a byte-by-byte comparison will show them as being completely different.

Normalization is an attempt at dealing with that. Normalizing assures (or at least tries to) that all the characters are encoded the same way -- either all using a separate combining diacritical mark where needed, or all using a single code point wherever possible. From a viewpoint of comparison, it doesn't really matter a whole lot which you choose -- pretty much any normalized string will compare properly with another normalized string.

In this case, "compatibility" means compatibility with code that assumes that one code point equals one character. If you have code like that, you probably want to use the compatibility normal form. Although I've never seen it stated directly, the names of the normal forms imply that the Unicode consortium considers it preferable to use separate combining diacritical marks. This requires more intelligence to count the actual characters in a string (as well as things like breaking a string intelligently), but is more versatile.

If you're making full use of ICU, chances are that you want to use the canonical normal form. If you're trying to write code on your own that (for example) assumes a code point equals a character, then you probably want the compatibility normal form that makes that true as often as possible.

简单 2024-12-19 14:36:31

如果两个 unicode 字符串在规范上是等效的,那么这些字符串实际上是相同的,只是使用了不同的 unicode 序列。例如,可以使用字符 Ä 或 A 和 ◌̈ 的组合来表示 Ä。

如果字符串只是兼容性等效,则字符串不一定相同,但在某些上下文中它们可能相同。例如 ff 可以被认为与 ff 相同。

因此,如果您要比较字符串,则应该使用规范等效,因为兼容性等效并不是真正的等效。

但是,如果您想对一组字符串进行排序,那么使用兼容性等效可能是有意义的,因为它们几乎相同。

If two unicode strings are canonically equivalent the strings are really the same, only using different unicode sequences. For example Ä can be represented either using the character Ä or a combination of A and ◌̈.

If the strings are only compatibility equivalent the strings aren't necessarily the same, but they may be the same in some contexts. E.g. ff could be considered same as ff.

So, if you are comparing strings you should use canonical equivalence, because compatibility equivalence isn't real equivalence.

But if you want to sort a set of strings it might make sense to use compatibility equivalence as the are nearly identical.

浪推晚风 2024-12-19 14:36:31

这实际上相当简单。 UTF-8 实际上对同一个“字符”有几种不同的表示形式。 (我在引号中使用字符,因为它们在字节方面是不同的,但实际上它们是相同的)。链接文档中给出了一个示例。

字符“Ç”可以表示为字节序列0xc387。但它也可以用 C (0x43) 后跟字节序列 0xcca7 来表示。所以你可以说 0xc387 和 0x43cca7 是同一个字符。之所以有效,是因为 0xcca7 是一个组合标记;也就是说,它采用前面的字符(此处为 C),并对其进行修改。

现在,就规范等效与兼容性等效之间的差异而言,我们需要总体上看一下字符。

有两种类型的字符,一种通过传达含义,另一种采用另一个字符并改变它。 9是一个有意义的字符。上标⁹具有该含义并通过表示来改变它。因此,按照规范,它们具有不同的含义,但它们仍然代表基本字符。

规范等效是指字节序列呈现具有相同含义的相同字符。兼容性等效是指字节序列呈现具有相同基本含义的不同字符(即使它可能会被更改)。 9 和 ⁹ 是兼容性等效的,因为它们都表示“9”,但在规范上并不等效,因为它们没有相同的表示形式。

This is actually fairly simple. UTF-8 actually has several different representations of the same "character". (I use character in quotes since byte-wise they are different, but practically they are the same). An example is given in the linked document.

The character "Ç" can be represented as the byte sequence 0xc387. But it can also be represented by a C (0x43) followed by the byte sequence 0xcca7. So you can say that 0xc387 and 0x43cca7 are the same character. The reason that works, is that 0xcca7 is a combining mark; that is to say it takes the character before it (a C here), and modifies it.

Now, as far as the difference between canonical equivalence vs compatibility equivalence, we need to look at characters in general.

There are 2 types of characters, those that convey meaning through the value, and those that take another character and alter it. 9 is a meaningful character. A super-script ⁹ takes that meaning and alters it by presentation. So canonically they have different meanings, but they still represent the base character.

Canonical equivalence is where the byte sequence is rendering the same character with the same meaning. Compatibility equivalence is when the byte sequence is rendering a different character with the same base meaning (even though it may be altered). The 9 and ⁹ are compatibility equivalent since they both mean "9", but are not canonically equivalent since they don't have the same representation.

七颜 2024-12-19 14:36:31

规范等效或兼容性等效与您更相关取决于您的应用程序。考虑字符串比较的 ASCII 方式大致映射到规范等价,但 Unicode 代表了很多语言。我认为,假设 Unicode 对所有语言进行编码的方式允许您像西欧 ASCII 一样对待它们,这是不安全的。

图 1 和 2 提供了两种等效类型的良好示例。在兼容性等效下,看起来下标和上标形式的相同数字比较相等。但我不确定是否能解决与草书阿拉伯语形式或旋转字符相同的问题。

Unicode 文本处理的残酷事实是,您必须深入思考应用程序的文本处理要求,然后使用可用的工具尽可能地解决这些要求。这并不能直接解决您的问题,但更详细的答案需要您希望支持的每种语言的语言专家。

Whether canonical equivalence or compatibility equivalence is more relevant to you depends on your application. The ASCII way of thinking about string comparisons roughly maps to canonical equivalence, but Unicode represents a lot of languages. I don't think it is safe to assume that Unicode encodes all languages in a way that allows you to treat them just like western european ASCII.

Figures 1 and 2 provide good examples of the two types of equivalence. Under compatibility equivalence, it looks like the same number in sub- and super- script form would compare equal. But I'm not sure that solve the same problem that as the cursive arabic form or the rotated characters.

The hard truth of Unicode text processing is that you have to think deeply about your application's text processing requirements, and then address them as well as you can with the available tools. That doesn't directly address your question, but a more detailed answer would require linguistic experts for each of the languages you expect to support.

十年不长 2024-12-19 14:36:31

比较字符串的问题:对于大多数应用程序来说,两个内容相同的字符串可能包含不同的字符序列。

请参阅 Unicode 的规范等效:如果比较算法很简单(或必须很快),则不执行 Unicode 等效。例如,在 XML 规范比较中会出现此问题,请参阅 http://www.w3.org /TR/xml-c14n

为了避免这个问题...使用什么标准? “扩展 UTF8”还是“紧凑 UTF8”?
使用“ç”或“c+◌̧。”?

W3C 和其他人(例如文件名)建议使用“composed as canonical”(记住“most”的 C)紧凑”较短的字符串)...所以,

标准是C!如有疑问,请使用 NFC

实现互操作性,并实现“约定优于配置”选择,建议使用NFC来“规范”外部字符串。例如,要存储规范 XML,请将其存储在“FORM_C”中。 W3C 的网络上的 CSV 工作组也推荐NFC(第 7.2 节)。

PS:de“FORM_C”是大多数库中的默认形式。前任。 在PHP的normalizer.isnormalized()中。


术语“组合形式”(FORM_C) 用于表示“字符串采用 C 规范形式”(NFC 转换的结果)并说使用了转换算法...参见 http://www.macchiato.com/unicode/nfc-faq

(...)以下每个序列(前两个是单字符序列)代表相同的字符:

  1. U+00C5 ( Å ) 上面有环的拉丁文大写字母 A
  2. U+212B ( Å ) 埃符号
  3. U+0041 ( A ) 拉丁文大写字母 A + U+030A ( ̊ ) 组合上面的环

这些序列被称为规范等效序列。第一种形式称为 NFC - 标准化形式 C,其中 C 代表组合
(...) 将字符串 S 转换为 NFC 形式的函数可缩写为 toNFC(S),而测试 S 是否为 NFC 的函数可缩写为 isNFC(S).


注意:要测试小字符串(纯 UTF-8 或 XML 实体引用)的规范化,您可以使用 此测试/标准化在线转换器

The problem of compare strings: two strings with content that is equivalent for the purposes of most applications may contain differing character sequences.

See Unicode's canonical equivalence: if the comparison algorithm is simple (or must be fast), the Unicode equivalence is not performed. This problem occurs, for instance, in XML canonical comparison, see http://www.w3.org/TR/xml-c14n

To avoid this problem... What standard to use? "expanded UTF8" or "compact UTF8"?
Use "ç" or "c+◌̧."?

W3C and others (ex. file names) suggest to use the "composed as canonical" (take in mind C of "most compact" shorter strings)... So,

The standard is C! in doubt use NFC

For interoperability, and for "convention over configuration" choices, the recommendation is the use of NFC, to "canonize" external strings. To store canonical XML, for example, store it in the "FORM_C". The W3C's CSV on the Web Working Group also recomend NFC (section 7.2).

PS: de "FORM_C" is the default form in most of libraries. Ex. in PHP's normalizer.isnormalized().


Ther term "compostion form" (FORM_C) is used to both, to say that "a string is in the C-canonical form" (the result of a NFC transformation) and to say that a transforming algorithm is used... See http://www.macchiato.com/unicode/nfc-faq

(...) each of the following sequences (the first two being single-character sequences) represent the same character:

  1. U+00C5 ( Å ) LATIN CAPITAL LETTER A WITH RING ABOVE
  2. U+212B ( Å ) ANGSTROM SIGN
  3. U+0041 ( A ) LATIN CAPITAL LETTER A + U+030A ( ̊ ) COMBINING RING ABOVE

These sequences are called canonically equivalent. The first of these forms is called NFC - for Normalization Form C, where the C is for compostion.
(...) A function transforming a string S into the NFC form can be abbreviated as toNFC(S), while one that tests whether S is in NFC is abbreviated as isNFC(S).


Note: to test of normalization of little strings (pure UTF-8 or XML-entity references), you can use this test/normalize online converter.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文