为什么 string.Compare 处理重音字符的方式似乎不一致?
如果我执行以下语句:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
结果是'-1',表明'mun'的数值比'mün'低。
但是,如果我执行此语句:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
我得到“1”,表明“Muntelier,Schewiz”应该放在最后。
这是比较中的错误吗?或者,更有可能的是,在对包含重音的字符串进行排序时,我是否应该考虑一条规则?
这是一个问题的原因是,我正在对列表进行排序,然后执行手动二进制过滤器,这意味着每个字符串都以 ' 开头xxx'。
以前我使用的是 Linq 'Where' 方法,但现在我必须使用另一个人编写的这个自定义函数,因为他说它性能更好。
但自定义函数似乎没有考虑.NET 的任何“unicode”规则。因此,如果我告诉它按“mün”进行过滤,即使列表中存在以“mun”开头的项目,它也找不到任何项目。
这似乎是因为重音字符的顺序不一致,具体取决于重音字符后面的字符。
好的,我想我已经解决了这个问题。
在过滤器之前,我根据每个字符串的前 n 个字母进行排序,其中 n 是搜索字符串的长度。
If I execute the following statement:
string.Compare("mun", "mün", true, CultureInfo.InvariantCulture)
The result is '-1', indicating that 'mun' has a lower numeric value than 'mün'.
However, if I execute this statement:
string.Compare("Muntelier, Schweiz", "München, Deutschland", true, CultureInfo.InvariantCulture)
I get '1', indicating that 'Muntelier, Schewiz' should go last.
Is this a bug in the comparison? Or, more likely, is there a rule I should be taking into account when sorting strings containing accented
The reason this is an issue is, I'm sorting a list and then doing a manual binary filter that's meant to get every string beginning with 'xxx'.
Previously I was using the Linq 'Where' method, but now I have to use this custom function written by another person, because he says it performs better.
But the custom function doesn't seem to take into account whatever 'unicode' rules .NET has. So if I tell it to filter by 'mün', it doesn't find any items, even though there are items in the list beginning with 'mun'.
This seems to be because of the inconsistent ordering of accented characters, depending on what characters go after the accented character.
OK, I think I've fixed the problem.
Before the filter, I do a sort based on the first n letters of each string, where n is the length of the search string.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有一种打破平局的算法,请参阅 http://unicode.org/reports/tr10/
因此,“Munt...”和“Münc...”按字母顺序不同,并根据“t”和“c”排序。
然而,“mun”和“mün”按字母顺序相同(“u”相当于失传语言中的“ü”),因此比较字符代码
There is a tie-breaking algorithm at work, see http://unicode.org/reports/tr10/
So, "Munt..." and "Münc..." are alphabetically different and sort based on the "t" and "c".
Whereas, "mun" and "mün" are alphabetically the same ("u" equivelent to "ü" in lost languages) so the character codes are compared
看起来重音字符仅在某种“抢七”情况下使用 - 换句话说,如果字符串在其他方面相等。
下面是一些示例代码来演示:(
我也尝试在“n”后面添加一个空格,看看它是否是在单词边界上完成的 - 事实并非如此。)
结果:
我怀疑这对于各种复杂的 Unicode 来说都是正确的规则 - 但我对它们了解不够。
至于是否需要考虑到这一点……我不希望如此。你在做什么,被这个抛出了?
It looks like the accented character is only being used in a sort of "tie-break" situation - in other words, if the strings are otherwise equal.
Here's some sample code to demonstrate:
(I've tried adding a space after the "n" as well, to see if it was done on word boundaries - it isn't.)
Results:
I suspect this is correct by various complicated Unicode rules - but I don't know enough about them.
As for whether you need to take this into account... I wouldn't expect so. What are you doing that is thrown by this?
据我了解,这仍然有些一致。使用
CultureInfo.InvariantCulture
进行比较时,变音字符ü
被视为非重音字符u
。由于第一个示例中的字符串显然不相等,因此结果不会是 0 而是 -1 (这似乎是默认值)。在第二个示例中,Muntelier 排在最后,因为在字母表中 t 位于 c 之后。
我在 MSDN 中找不到任何明确的文档来解释这些规则,但我发现了这一点
并
给出了所需的结果。
无论如何,我认为您最好根据特定的文化进行排序,例如当前用户的文化(如果可能)。
As I understand this it is still somewhat consistent. When comparing using
CultureInfo.InvariantCulture
the umlaut characterü
is treated like the non-accented characteru
.As the strings in your first example obviously are not equal the result will not be 0 but -1 (which seems to be a default value). In the second example Muntelier goes last because t follows c in the alphabet.
I couldn't find any clear documentation in MSDN explaining these rules, but I found that
and
gives the desired result.
Anyway, I think you'd be better off to base your sorting on a specific culture such as the current user's culture (if possible).