大写与小写
在进行不区分大小写的比较时,将字符串转换为大写还是小写哪个更有效? 这还重要吗?
在这篇 SO 文章中建议 C# 使用 ToUpper 更高效,因为“微软优化了就这样。” 但我也读过 这个论点 转换 ToLower 与 ToUpper 取决于你的字符串包含更多,并且通常字符串包含更多小写字符,这使得 ToLower 更高效。
我特别想知道:
- 是否有一种方法可以优化 ToUpper 或 ToLower 以使其中一个比另一个更快?
- 在大写或小写字符串之间进行不区分大小写的比较是否更快,为什么?
- 是否有任何编程环境(例如 C、C#、Python 等等)其中一种情况明显优于另一种情况,为什么?
When doing case-insensitive comparisons, is it more efficient to convert the string to upper case or lower case? Does it even matter?
It is suggested in this SO post that C# is more efficient with ToUpper because "Microsoft optimized it that way." But I've also read this argument that converting ToLower vs. ToUpper depends on what your strings contain more of, and that typically strings contain more lower case characters which makes ToLower more efficient.
In particular, I would like to know:
- Is there a way to optimize ToUpper or ToLower such that one is faster than the other?
- Is it faster to do a case-insensitive comparison between upper or lower case strings, and why?
- Are there any programming environments (eg. C, C#, Python, whatever) where one case is clearly better than the other, and why?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
由于某些文化(尤其是土耳其)的“有趣”特征,转换为大写或小写以进行不区分大小写的比较是不正确的。 相反,请使用带有适当选项的 StringComparer。
MSDN 有一些关于字符串处理的很棒的指南。 您可能还想检查您的代码是否通过了 火鸡测试。
编辑:请注意尼尔关于序数不区分大小写比较的评论。 整个领域相当模糊:(
Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.
MSDN has some great guidelines on string handling. You might also want to check that your code passes the Turkey test.
EDIT: Note Neil's comment around ordinal case-insensitive comparisons. This whole realm is pretty murky :(
来自 MSDN 上的 Microsoft:
为什么? 来自微软:
无法往返的角色的例子是什么?
.NET Fiddle
这就是为什么,如果您想要进行不区分大小写的比较,请将字符串转换为大写,而不是小写。
因此,如果您必须选择一个,请选择大写。
From Microsoft on MSDN:
Why? From Microsoft:
What is example of such a character that cannot make a round trip?
.NET Fiddle
That is why, if your want to do case insensitive comparisons you convert the strings to uppercase, and not lowercase.
So if you have to choose one, choose Uppercase.
根据 MSDN 传递字符串并告知会更有效忽略大小写的比较:
当然,如果您一遍又一遍地比较一个字符串,那么这可能不成立。
According to MSDN it is more efficient to pass in the strings and tell the comparison to ignore case:
Of course, if you are comparing one string over and over again then this may not hold.
基于字符串往往有更多的小写条目,理论上 ToLower 应该更快(大量比较,但很少赋值)。
在 C 中,或者当使用每个字符串的可单独访问的元素(例如 C 字符串或 C++ 中的 STL 字符串类型)时,它实际上是字节比较 - 因此比较
UPPER
与没有什么不同更低。
如果您偷偷摸摸地将字符串加载到
long
数组中,那么您将获得对整个字符串的非常快速的比较,因为它一次可以比较 4 个字节。 但是,加载时间可能会使其不值得。为什么你需要知道哪个更快? 除非您正在进行公制的比较,否则运行速度快几个周期与整体执行的速度无关,并且听起来像是过早的优化:)
Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).
In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing
UPPER
is no different fromlower
.If you were sneaky and loaded your strings into
long
arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)
Microsoft 优化了
ToUpperInvariant()
,而不是ToUpper()
。 不同之处在于,不变性对文化更加友好。 如果需要对区域性可能不同的字符串进行不区分大小写的比较,请使用 Invariant,否则不变转换的性能并不重要。我不能说 ToUpper() 还是 ToLower() 更快。 我从来没有尝试过,因为我从来没有遇到过性能如此重要的情况。
Microsoft has optimized
ToUpperInvariant()
, notToUpper()
. The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.
如果您在 C# 中进行字符串比较,使用 .Equals() 比将两个字符串都转换为大写或小写要快得多。 使用 .Equals() 的另一个大优点是不会为 2 个新的大写/小写字符串分配更多内存。
If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.
我想要一些关于此的实际数据,所以我提取了两个字节的完整列表
因
ToLower
或ToUpper
失败的字符。 然后我运行了下面的测试:结果如下。 注意我还使用
Invariant
版本进行了测试,结果是完全一样。 有趣的是,其中一对都失败了。 但基于此
ToUpper 是最好的选择。
I wanted some actual data on this, so I pulled the full list of two byte
characters that fail with
ToLower
orToUpper
. I then ran this test below:Result below. Note I also tested with the
Invariant
versions, and result wasexact same. Interestingly, one of the pairs fails with both. But based on this
ToUpper is the best option.
这真的不应该重要。 对于 ASCII 字符,这绝对不重要 - 只需进行一些比较和任一方向的一点翻转即可。 Unicode 可能有点复杂,因为有些字符会以奇怪的方式更改大小写,但除非您的文本充满了这些特殊字符,否则实际上不应该有任何区别。
It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.
如果做得正确,如果转换为小写字母,应该会有一个小的、微不足道的速度优势,但正如许多人所暗示的那样,这是文化相关的,并且不是在函数中继承,而是在您转换的字符串中继承(大量小写字母)意味着很少的内存分配)——如果您有一个包含大量大写字母的字符串,则转换为大写字母会更快。
Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.