Unicode::Normalize - 查询“标准化自”
#!/usr/local/bin/perl
use warnings;
use 5.014;
use Unicode::Normalize qw(NFD NFC compose);
my $string1 = "\x{f5}";
my $NFD_string1 = NFD( $string1 );
# PV = 0x831150 "o\314\203"\0 [UTF8 "o\x{303}"] *
my $composed_NFD_string1 = compose( $NFD_string1 );
# PV = 0x77bc40 "\303\265"\0 [UTF8 "\x{f5}"] *
my $NFC_string1 = NFC( $string1 );
# PV = 0x836e30 "\303\265"\0 [UTF8 "\x{f5}"] *
my $string2 = "o\x{303}";
my $NFD_string2 = NFD( $string2 );
# PV = 0x780da0 "o\314\203"\0 [UTF8 "o\x{303}"] *
my $composed_NFD_string2 = compose( $NFD_string2 );
# PV = 0x782dc0 "\303\265"\0 [UTF8 "\x{f5}"] *
my $NFC_string2 = NFC( $string2 );
# PV = 0x7acba0 "\303\265"\0 [UTF8 "\x{f5}"] *
# * from Devel::Peek::Dump output
say 'OK' if $NFD_string1 eq $NFD_string2;
say 'OK' if $NFC_string1 eq $NFC_string2;
输出:
好的
好的
尝试之后我问我: 是否有理由使用标准化表格 D
而不是标准化表格 C
?
#!/usr/local/bin/perl
use warnings;
use 5.014;
use Unicode::Normalize qw(NFD NFC compose);
my $string1 = "\x{f5}";
my $NFD_string1 = NFD( $string1 );
# PV = 0x831150 "o\314\203"\0 [UTF8 "o\x{303}"] *
my $composed_NFD_string1 = compose( $NFD_string1 );
# PV = 0x77bc40 "\303\265"\0 [UTF8 "\x{f5}"] *
my $NFC_string1 = NFC( $string1 );
# PV = 0x836e30 "\303\265"\0 [UTF8 "\x{f5}"] *
my $string2 = "o\x{303}";
my $NFD_string2 = NFD( $string2 );
# PV = 0x780da0 "o\314\203"\0 [UTF8 "o\x{303}"] *
my $composed_NFD_string2 = compose( $NFD_string2 );
# PV = 0x782dc0 "\303\265"\0 [UTF8 "\x{f5}"] *
my $NFC_string2 = NFC( $string2 );
# PV = 0x7acba0 "\303\265"\0 [UTF8 "\x{f5}"] *
# * from Devel::Peek::Dump output
say 'OK' if $NFD_string1 eq $NFD_string2;
say 'OK' if $NFC_string1 eq $NFC_string2;
Output:
OK
OK
After trying this I asked me:
Is there a reason to use the Normalization Form D
instead of the Normalization Form C
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
并不是所有东西都有复合形式,NFC 实际上首先做了 NFD。 NFD 的一部分是将连续字符按顺序放在起始字符之后,这样您就可以比较两个字素簇(起始字符及其连续字符的奇特名称)以查看它们是否相同。对于您在本例中所做的事情,您应该得到相同的答案,但 NFC 实际上做了更多的工作。
有些东西没有特殊的 NFC 版本有几个原因。其中许多来自历史角色集。 é 的复合版本是为了让拉丁 1 语言的人们高兴。还有 e 和 ´ 版本,旨在让您自己构建字素。有很多方法可以做到这一点,不仅仅是重音和变音符号。字素簇可以有几个连续字符,当您自己构建它们时,您可以按照您喜欢的任何顺序(无论出于何种原因)放置它们。然而,他们已经分配了权重。 NFD 将按权重对它们重新排序,以便您可以比较两个字素簇,无论您使用的顺序如何。
一切都在 Unicode 技术报告 15 中,正如 daxim 在评论中所说。您需要查看图表并阅读以下部分:
有些东西明确使用 NFD 来存储其数据,例如 HFS+ 文件系统。在许多情况下,这并不重要,因为您的编程语言可能绑定到将文件名字符串转换为正确形式的库函数。
今天晚些时候,我将上传Unicode::Support,它演示了其中的许多内容。
今天晚些时候,汤姆会来给我们大家上学。 :)
Not everything has a composite form, and NFC actually does an NFD first. Part of NFD is putting continuation characters in order after the starter character so you can compare two grapheme clusters (the fancy name for a starter along with its continuation characters) to see if they are the same. For what you are doing in this example, you should get the same answers, but NFC actually does more work.
There are a couple of reasons that some things don't have a special NFC version. Many of those came from historical character sets. The composed version of é is there to make the Latin-1 people happy. There's also the e and ´ versions designed to allow you to build the grapheme on your own. There are many ways to do that, and it's not just accents and diacriticals. Grapheme clusters can have several of those continuation characters, and as you build them yourself, you can put them in any order you like (for whatever reason). However, they have assigned weights. NFD will reorder them by their weights so you can compare two grapheme clusters despite the order you used.
It's all in Unicode Technical Report 15, just as daxim said in the comment. You'll want to see the diagrams and read around the part that says:
Some things explicitly use NFD for their data, such as the HFS+ file system. That doesn't much matter in many cases because your programming language probably binds to library functions that transforms your filename strings into the right form.
Sometime later today I'll be uploading Unicode::Support which demonstrates many of these things.
And, later today, Tom will come along and school us all. :)