为什么 string.Compare 处理重音字符的方式似乎不一致?

发布于 2024-08-03 06:33:09 字数 757 浏览 5 评论 0原文

如果我执行以下语句:

string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)

结果是'-1',表明'mun'的数值比'mün'低。

但是,如果我执行此语句:

string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)

我得到“1”,表明“Muntelier,Schewiz”应该放在最后。

这是比较中的错误吗?或者,更有可能的是,在对包含重音的字符串进行排序时,我是否应该考虑一条规则?


这是一个问题的原因是,我正在对列表进行排序,然后执行手动二进制过滤器,这意味着每个字符串都以 ' 开头xxx'。

以前我使用的是 Linq 'Where' 方法,但现在我必须使用另一个人编写的这个自定义函数,因为他说它性能更好。

但自定义函数似乎没有考虑.NET 的任何“unicode”规则。因此,如果我告诉它按“mün”进行过滤,即使列表中存在以“mun”开头的项目,它也找不到任何项目。

这似乎是因为重音字符的顺序不一致,具体取决于重音字符后面的字符。


好的,我想我已经解决了这个问题。

在过滤器之前,我根据每个字符串的前 n 个字母进行排序,其中 n 是搜索字符串的长度。

If I execute the following statement:

string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)

The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.

However, if I execute this statement:

string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)

I get '1', indicating that 'Muntelier, Schewiz' should go last.

Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented


The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.

Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.

But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.

This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.


OK, I think I've fixed the problem.

Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

垂暮老矣 2024-08-10 06:33:09

有一种打破平局的算法,请参阅 http://unicode.org/reports/tr10/

为了解决复杂性
语言敏感排序,
多级比较算法是
受雇。在比较两个词时,对于
例如,最重要的特征是
基本字符:例如
A和B之间的区别。
口音差异通常是
忽略,如果有任何差异
在基本字母中。大小写差异
(大写与小写),是
通常会被忽略,如果有的话
基础或重音的差异。
标点符号是可变的。在一些
标点符号的情况
被视为基本角色。在
其他情况,应忽略
是否有任何碱基、重音或大小写
差异。还可能有一个
最终的决胜局级别,如果
没有其他差异
在字符串中,(标准化)代码
使用点顺序。

因此,“Munt...”和“Münc...”按字母顺序不同,并根据“t”和“c”排序。

然而,“mun”和“mün”按字母顺序相同(“u”相当于失传语言中的“ü”),因此比较字符代码

There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/

To address the complexities of
language-sensitive sorting, a
multilevel comparison algorithm is
employed. In comparing two words, for
example, the most important feature is
the base character: such as the
difference between an A and a B.
Accent differences are typically
ignored, if there are any differences
in the base letters. Case differences
(uppercase versus lowercase), are
typically ignored, if there are any
differences in the base or accents.
Punctuation is variable. In some
situations a punctuation character is
treated like a base character. In
other situations, it should be ignored
if there are any base, accent, or case
differences. There may also be a
final, tie-breaking level, whereby if
there are no other differences at all
in the string, the (normalized) code
point order is used.

So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".

Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared

酸甜透明夹心 2024-08-10 06:33:09

看起来重音字符仅在某种“抢七”情况下使用 - 换句话说,如果字符串在其他方面相等。

下面是一些示例代码来演示:(

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        Compare("mun", "mün");
        Compare("muna", "münb");
        Compare("munb", "müna");
    }

    static void Compare(string x, string y)
    {
        int result = string.Compare(x, y, true, 
                                   CultureInfo.InvariantCulture));

        Console.WriteLine("{0}; {1}; {2}", x, y, result);
    }
}

我也尝试在“n”后面添加一个空格,看看它是否是在单词边界上完成的 - 事实并非如此。)

结果:

mun; mün; -1
muna; münb; -1
munb; müna; 1

我怀疑这对于各种复杂的 Unicode 来说都是正确的规则 - 但我对它们了解不够。

至于是否需要考虑到这一点……我不希望如此。你在做什么,被这个抛出了?

It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.

Here's some sample code to demonstrate:

using System;
using System.Globalization;

class Test
{
    static void Main()
    {
        Compare("mun", "mün");
        Compare("muna", "münb");
        Compare("munb", "müna");
    }

    static void Compare(string x, string y)
    {
        int result = string.Compare(x, y, true, 
                                   CultureInfo.InvariantCulture));

        Console.WriteLine("{0}; {1}; {2}", x, y, result);
    }
}

(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)

Results:

mun; mün; -1
muna; münb; -1
munb; müna; 1

I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.

As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?

小猫一只 2024-08-10 06:33:09

据我了解,这仍然有些一致。使用 CultureInfo.InvariantCulture 进行比较时,变音字符 ü 被视为非重音字符 u

由于第一个示例中的字符串显然不相等,因此结果不会是 0 而是 -1 (这似乎是默认值)。在第二个示例中,Muntelier 排在最后,因为在字母表中 t 位于 c 之后。

我在 MSDN 中找不到任何明确的文档来解释这些规则,但我发现了这一点

string.Compare("mun", "mün", CultureInfo.InvariantCulture,  
    CompareOptions.StringSort);

string.Compare("Muntelier, Schweiz", "München, Deutschland", 
    CultureInfo.InvariantCulture, CompareOptions.StringSort);

给出了所需的结果。

无论如何,我认为您最好根据特定的文化进行排序,例如当前用户的文化(如果可能)。

As I understand this it is still somewhat consistent. When comparing using CultureInfo.InvariantCulture the umlaut character ü is treated like the non-accented character u.

As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.

I couldn't find any clear documentation in MSDN explaining these rules, but I found that

string.Compare("mun", "mün", CultureInfo.InvariantCulture,  
    CompareOptions.StringSort);

and

string.Compare("Muntelier, Schweiz", "München, Deutschland", 
    CultureInfo.InvariantCulture, CompareOptions.StringSort);

gives the desired result.

Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文