Delphi 是否有 Unicode 排序算法 (UCA) 代码?
Unicode 技术标准 #10 (UCA) 下的排序规则与 Unicode 兼容是分开的,如果您想知道这一点,它不仅意味着排序/排序,还意味着比较,“字符串 1 是否等于字符串 2”的问题”。有时,出于整理和比较的目的,两个字符串中值不同的代码点被视为相等,至少此 博客文章 从 Perl 标准库的角度进行讨论。
我想知道的是,(a) Delphi XE2 是否已经完全实现了整个 Unicode 排序规范, (b) 如果没有,第三方图书馆是否这样做?
示例代码:
Str1 := Chr($212B);
Str2 := Chr($C5);
n := CompareStr(Str1,Str2); // in delphi this is not zero, under UCA rules, should be 0.
根据 Unicode 排序规则规范,Unicode 排序规则应考虑比较所有上述代码点的等效性。从二进制的角度来看,这是没有意义的,所以我很高兴 Delphi 中的 CompareStr 和 perl 中的 cmp (来自链接的文章)都没有受到 Unicode 故障的污染,但是如果您想做一个兼容 unicode 的操作该怎么办? Delphi 中的排序规则,如 perl Unicode::Collation 库?如何?
Update AnsiCompareStr
将调用 Win32 CompareString
并处理某些特定于区域设置的情况(如上面的情况),并且从互联网上阅读,经典的 Windows unicode排序规则行为和 UCA 正在缓慢但不完全地融合,UCA 似乎是经过更改以使其更像 Windows 排序规则的行为。
Collation under the Unicode Technical Standard #10 (UCA), which is a separate thing from being Unicode Compliant, in case you were wondering about that, implies not only ordering/sorting but also comparison, questions of "is string 1 equal to string 2". Sometimes code points which are not the same value in both strings are to be considered equal for collation and comparison purposes, at least that is implied by this blog post which is talking from a Perl standard library perspective.
What I want to know is, does (a) Delphi XE2 already fully implement the entire Unicode Collation Spec, and (b) if not, does a third party library do so?
Sample code:
Str1 := Chr($212B);
Str2 := Chr($C5);
n := CompareStr(Str1,Str2); // in delphi this is not zero, under UCA rules, should be 0.
According to the Unicode collation spec, Unicode collation should consider all the above codepoints equivalent under comparison. That makes no sense from a binary point of view, and so I'm glad that neither CompareStr in Delphi, nor cmp in perl (from the linked article) are polluted with Unicode glitches, but what if you want to do a unicode-compliant collation in Delphi, like the perl Unicode::Collation library? How?
Update AnsiCompareStr
would call the Win32 CompareString
and would handle some locale specific cases like the above, and from reading around the internet, the classic Windows unicode collation behaviour and UCA are converging slowly but not completely, with UCA seeming to be the one that gets changed to make it more like Windows collation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
(a) 不。Delphi 的
AnsiCompareStr
和 co。包装 Win32 CompareString 函数,其中 不遵循 Unicode 排序规则算法。(b) ICU 项目确实支持它,但是 Delphi 包装器 ICU4PAS,自 2007 年以来一直没有更新。
这可能没有必要不过对你来说。您看到这种行为的原因是您正在使用 CompareStr 而不是 AnsiCompareStr。非 ANSI 版本在 SysUtils 中以 asm 形式编写,逐个字符进行比较,并且不考虑等效或组合字符。不区分大小写的版本 CompareText 也仅适用于 az。 ANSI 版本在内部调用 CompareString,它可以识别区域设置并处理所有这些情况。
请注意,这只适用于 SysUtils 中的例程。在 StrUtils.pas 中,非 ANSI 版本只是 ANSI 版本的内联包装器,因此它们都是区域设置感知的。
(a) No. Delphi's
AnsiCompareStr
and co. wrap the Win32 CompareString function, which does not follow the Unicode collation algorithm.(b) The ICU project does support it, but the Delphi wrapper, ICU4PAS, hasn't been updated since 2007.
That may not be necessary for you though. The reason you're seeing the behavior you are is because you're using CompareStr instead of AnsiCompareStr. The non-ANSI version is written in asm in SysUtils, compares char-by-char, and doesn't take equivalence or combining characters into account. The case insensitive version, CompareText, also only works with a-z. The ANSI versions call CompareString internally which is locale-aware and does handle all of those cases.
Note that that's only true for the routines in SysUtils though. In StrUtils.pas the non-ANSI versions are just inline wrappers around the ANSI ones, so they are all locale aware.