Unicode::Normalize - 查询“标准化自”

发布于 2024-11-25 12:04:41 字数 970 浏览 6 评论 0原文

#!/usr/local/bin/perl
use warnings;
use 5.014;
use Unicode::Normalize qw(NFD NFC compose);


my $string1 = "\x{f5}";

my $NFD_string1 = NFD( $string1 ); 
# PV = 0x831150 "o\314\203"\0 [UTF8 "o\x{303}"] *

my $composed_NFD_string1 = compose( $NFD_string1 ); 
#  PV = 0x77bc40 "\303\265"\0 [UTF8 "\x{f5}"] *

my $NFC_string1 = NFC( $string1 );
#  PV = 0x836e30 "\303\265"\0 [UTF8 "\x{f5}"] *


my $string2 = "o\x{303}";

my $NFD_string2 = NFD( $string2 );
#  PV = 0x780da0 "o\314\203"\0 [UTF8 "o\x{303}"] *

my $composed_NFD_string2 = compose( $NFD_string2 ); 
#  PV = 0x782dc0 "\303\265"\0 [UTF8 "\x{f5}"] *  

my $NFC_string2 = NFC( $string2 );
#  PV = 0x7acba0 "\303\265"\0 [UTF8 "\x{f5}"] * 

# * from Devel::Peek::Dump output


say 'OK' if $NFD_string1 eq $NFD_string2;
say 'OK' if $NFC_string1 eq $NFC_string2;

输出:

好的
好的

尝试之后我问我: 是否有理由使用标准化表格 D 而不是标准化表格 C

#!/usr/local/bin/perl
use warnings;
use 5.014;
use Unicode::Normalize qw(NFD NFC compose);


my $string1 = "\x{f5}";

my $NFD_string1 = NFD( $string1 ); 
# PV = 0x831150 "o\314\203"\0 [UTF8 "o\x{303}"] *

my $composed_NFD_string1 = compose( $NFD_string1 ); 
#  PV = 0x77bc40 "\303\265"\0 [UTF8 "\x{f5}"] *

my $NFC_string1 = NFC( $string1 );
#  PV = 0x836e30 "\303\265"\0 [UTF8 "\x{f5}"] *


my $string2 = "o\x{303}";

my $NFD_string2 = NFD( $string2 );
#  PV = 0x780da0 "o\314\203"\0 [UTF8 "o\x{303}"] *

my $composed_NFD_string2 = compose( $NFD_string2 ); 
#  PV = 0x782dc0 "\303\265"\0 [UTF8 "\x{f5}"] *  

my $NFC_string2 = NFC( $string2 );
#  PV = 0x7acba0 "\303\265"\0 [UTF8 "\x{f5}"] * 

# * from Devel::Peek::Dump output


say 'OK' if $NFD_string1 eq $NFD_string2;
say 'OK' if $NFC_string1 eq $NFC_string2;

Output:

OK
OK

After trying this I asked me:
Is there a reason to use the Normalization Form D instead of the Normalization Form C?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

甜味超标? 2024-12-02 12:04:41

并不是所有东西都有复合形式,NFC 实际上首先做了 NFD。 NFD 的一部分是将连续字符按顺序放在起始字符之后,这样您就可以比较两个字素簇(起始字符及其连续字符的奇特名称)以查看它们是否相同。对于您在本例中所做的事情,您应该得到相同的答案,但 NFC 实际上做了更多的工作。

有些东西没有特殊的 NFC 版本有几个原因。其中许多来自历史角色集。 é 的复合版本是为了让拉丁 1 语言的人们高兴。还有 e 和 ´ 版本,旨在让您自己构建字素。有很多方法可以做到这一点,不仅仅是重音和变音符号。字素簇可以有几个连续字符,当您自己构建它们时,您可以按照您喜欢的任何顺序(无论出于何种原因)放置它们。然而,他们已经分配了权重。 NFD 将按权重对它们重新排序,以便您可以比较两个字素簇,无论您使用的顺序如何。

一切都在 Unicode 技术报告 15 中,正如 daxim 在评论中所说。您需要查看图表并阅读以下部分:

一旦字符串被完全分解,它包含的任何组合标记序列都会被放入明确定义的顺序中。这种组合标记的重新排列是根据 Unicode 规范化算法的一个子部分(称为规范排序算法)完成的。该算法根据 Canonical_Combining_Class (ccc) 属性的值对组合标记序列进行排序,该属性的值也在 UnicodeData.txt 中定义。大多数字符(包括所有非组合标记)的 Canonical_Combining_Class 值为零,并且不受规范排序算法的影响。这些字符由一个特殊术语“starter”来指代。只有具有非零 Canonical_Combining_Class 属性值的组合标记子集才可能受到规范排序算法的重新排序。这些角色被称为非首发角色。

有些东西明确使用 NFD 来存储其数据,例如 HFS+ 文件系统。在许多情况下,这并不重要,因为您的编程语言可能绑定到将文件名字符串转换为正确形式的库函数。

今天晚些时候,我将上传Unicode::Support,它演示了其中的许多内容。

今天晚些时候,汤姆会来给我们大家上学。 :)

Not everything has a composite form, and NFC actually does an NFD first. Part of NFD is putting continuation characters in order after the starter character so you can compare two grapheme clusters (the fancy name for a starter along with its continuation characters) to see if they are the same. For what you are doing in this example, you should get the same answers, but NFC actually does more work.

There are a couple of reasons that some things don't have a special NFC version. Many of those came from historical character sets. The composed version of é is there to make the Latin-1 people happy. There's also the e and ´ versions designed to allow you to build the grapheme on your own. There are many ways to do that, and it's not just accents and diacriticals. Grapheme clusters can have several of those continuation characters, and as you build them yourself, you can put them in any order you like (for whatever reason). However, they have assigned weights. NFD will reorder them by their weights so you can compare two grapheme clusters despite the order you used.

It's all in Unicode Technical Report 15, just as daxim said in the comment. You'll want to see the diagrams and read around the part that says:

Once a string has been fully decomposed, any sequences of combining marks that it contains are put into a well-defined order. This rearrangement of combining marks is done according to a subpart of the Unicode Normalization Algorithm known as the Canonical Ordering Algorithm. That algorithm sorts sequences of combining marks based on the value of their Canonical_Combining_Class (ccc) property, whose values are also defined in UnicodeData.txt. Most characters (including all non-combining marks) have a Canonical_Combining_Class value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters are referred to by a special term, starter. Only the subset of combining marks which have non-zero Canonical_Combining_Class property values are subject to potential reordering by the Canonical Ordering Algorithm. Those characters are called non-starters.

Some things explicitly use NFD for their data, such as the HFS+ file system. That doesn't much matter in many cases because your programming language probably binds to library functions that transforms your filename strings into the right form.

Sometime later today I'll be uploading Unicode::Support which demonstrates many of these things.

And, later today, Tom will come along and school us all. :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文