大写与小写

发布于 2024-07-08 01:41:44 字数 523 浏览 6 评论 0原文

在进行不区分大小写的比较时,将字符串转换为大写还是小写哪个更有效? 这还重要吗?

在这篇 SO 文章中建议 C# 使用 ToUpper 更高效,因为“微软优化了就这样。” 但我也读过 这个论点 转换 ToLower 与 ToUpper 取决于你的字符串包含更多,并且通常字符串包含更多小写字符,这使得 ToLower 更高效。

我特别想知道:

  • 是否有一种方法可以优化 ToUpper 或 ToLower 以使其中一个比另一个更快?
  • 在大写或小写字符串之间进行不区分大小写的比较是否更快,为什么?
  • 是否有任何编程环境(例如 C、C#、Python 等等)其中一种情况明显优于另一种情况,为什么?

When doing case-insensitive comparisons, is it more efficient to convert the string to upper case or lower case? Does it even matter?

It is suggested in this SO post that C# is more efficient with ToUpper because "Microsoft optimized it that way." But I've also read this argument that converting ToLower vs. ToUpper depends on what your strings contain more of, and that typically strings contain more lower case characters which makes ToLower more efficient.

In particular, I would like to know:

  • Is there a way to optimize ToUpper or ToLower such that one is faster than the other?
  • Is it faster to do a case-insensitive comparison between upper or lower case strings, and why?
  • Are there any programming environments (eg. C, C#, Python, whatever) where one case is clearly better than the other, and why?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

如梦亦如幻 2024-07-15 01:41:44

由于某些文化(尤其是土耳其)的“有趣”特征,转换为大写或小写以进行不区分大小写的比较是不正确的。 相反,请使用带有适当选项的 StringComparer

MSDN 有一些关于字符串处理的很棒的指南。 您可能还想检查您的代码是否通过了 火鸡测试

编辑:请注意尼尔关于序数不区分大小写比较的评论。 整个领域相当模糊:(

Converting to either upper case or lower case in order to do case-insensitive comparisons is incorrect due to "interesting" features of some cultures, particularly Turkey. Instead, use a StringComparer with the appropriate options.

MSDN has some great guidelines on string handling. You might also want to check that your code passes the Turkey test.

EDIT: Note Neil's comment around ordinal case-insensitive comparisons. This whole realm is pretty murky :(

梦晓ヶ微光ヅ倾城 2024-07-15 01:41:44

来自 MSDN 上的 Microsoft

在 .NET Framework 中使用字符串的最佳实践

字符串使用建议

为什么? 来自微软

将字符串标准化为大写

有一小组字符在转换为小写后无法进行往返。

无法往返的角色的例子是什么?

  • 开始:希腊语 Rho 符号 (U+03f1) ϱ
  • 大写: 大写希腊语 Rho (U+03a1) Ρ
  • 小写: 小写希腊语 Rho (U +03c1) ρ

ϱ、Pρ

.NET Fiddle

Original: ϱ
ToUpper: Ρ
ToLower: ρ

这就是为什么,如果您想要进行不区分大小写的比较,请将字符串转换为大写,而不是小写。

因此,如果您必须选择一个,请选择大写

From Microsoft on MSDN:

Best Practices for Using Strings in the .NET Framework

Recommendations for String Usage

Why? From Microsoft:

Normalize strings to uppercase

There is a small group of characters that when converted to lowercase cannot make a round trip.

What is example of such a character that cannot make a round trip?

  • Start: Greek Rho Symbol (U+03f1) ϱ
  • Uppercase: Capital Greek Rho (U+03a1) Ρ
  • Lowercase: Small Greek Rho (U+03c1) ρ

ϱ , Ρ , ρ

.NET Fiddle

Original: ϱ
ToUpper: Ρ
ToLower: ρ

That is why, if your want to do case insensitive comparisons you convert the strings to uppercase, and not lowercase.

So if you have to choose one, choose Uppercase.

手心的温暖 2024-07-15 01:41:44

根据 MSDN 传递字符串并告知会更有效忽略大小写的比较:

String.Compare(strA, strB, StringComparison.OrdinalIgnoreCase)
相当于(但比更快)调用

String.Compare(ToUpperInvariant(strA)、ToUpperInvariant(strB)、StringComparison.Ordinal)。

这些比较仍然非常快。

当然,如果您一遍又一遍地比较一个字符串,那么这可能不成立。

According to MSDN it is more efficient to pass in the strings and tell the comparison to ignore case:

String.Compare(strA, strB, StringComparison.OrdinalIgnoreCase)
is equivalent to (but faster than) calling

String.Compare(ToUpperInvariant(strA), ToUpperInvariant(strB), StringComparison.Ordinal).

These comparisons are still very fast.

Of course, if you are comparing one string over and over again then this may not hold.

末が日狂欢 2024-07-15 01:41:44

基于字符串往往有更多的小写条目,理论上 ToLower 应该更快(大量比较,但很少赋值)。

在 C 中,或者当使用每个字符串的可单独访问的元素(例如 C 字符串或 C++ 中的 STL 字符串类型)时,它实际上是字节比较 - 因此比较 UPPER没有什么不同更低。

如果您偷偷摸摸地将字符串加载到 long 数组中,那么您将获得对整个字符串的非常快速的比较,因为它一次可以比较 4 个字节。 但是,加载时间可能会使其不值得。

为什么你需要知道哪个更快? 除非您正在进行公制的比较,否则运行速度快几个周期与整体执行的速度无关,并且听起来像是过早的优化:)

Based on strings tending to have more lowercase entries, ToLower should theoretically be faster (lots of compares, but few assignments).

In C, or when using individually-accessible elements of each string (such as C strings or the STL's string type in C++), it's actually a byte comparison - so comparing UPPER is no different from lower.

If you were sneaky and loaded your strings into long arrays instead, you'd get a very fast comparison on the whole string because it could compare 4 bytes at a time. However, the load time might make it not worthwhile.

Why do you need to know which is faster? Unless you're doing a metric buttload of comparisons, one running a couple cycles faster is irrelevant to the speed of overall execution, and sounds like premature optimization :)

黒涩兲箜 2024-07-15 01:41:44

Microsoft 优化了 ToUpperInvariant(),而不是 ToUpper()。 不同之处在于,不变性对文化更加友好。 如果需要对区域性可能不同的字符串进行不区分大小写的比较,请使用 Invariant,否则不变转换的性能并不重要。

我不能说 ToUpper() 还是 ToLower() 更快。 我从来没有尝试过,因为我从来没有遇到过性能如此重要的情况。

Microsoft has optimized ToUpperInvariant(), not ToUpper(). The difference is that invariant is more culture friendly. If you need to do case-insensitive comparisons on strings that may vary in culture, use Invariant, otherwise the performance of invariant conversion shouldn't matter.

I can't say whether ToUpper() or ToLower() is faster though. I've never tried it since I've never had a situation where performance mattered that much.

孤凫 2024-07-15 01:41:44

如果您在 C# 中进行字符串比较,使用 .Equals() 比将两个字符串都转换为大写或小写要快得多。 使用 .Equals() 的另一个大优点是不会为 2 个新的大写/小写字符串分配更多内存。

If you are doing string comparison in C# it is significantly faster to use .Equals() instead of converting both strings to upper or lower case. Another big plus for using .Equals() is that more memory isn't allocated for the 2 new upper/lower case strings.

沉鱼一梦 2024-07-15 01:41:44

我想要一些关于此的实际数据,所以我提取了两个字节的完整列表
ToLowerToUpper 失败的字符。 然后我运行了下面的测试:

using System;

class Program {
   static void Main() {
      char[][] pairs = {
new[]{'\u00E5','\u212B'},new[]{'\u00C5','\u212B'},new[]{'\u0399','\u1FBE'},
new[]{'\u03B9','\u1FBE'},new[]{'\u03B2','\u03D0'},new[]{'\u03B5','\u03F5'},
new[]{'\u03B8','\u03D1'},new[]{'\u03B8','\u03F4'},new[]{'\u03D1','\u03F4'},
new[]{'\u03B9','\u1FBE'},new[]{'\u0345','\u03B9'},new[]{'\u0345','\u1FBE'},
new[]{'\u03BA','\u03F0'},new[]{'\u00B5','\u03BC'},new[]{'\u03C0','\u03D6'},
new[]{'\u03C1','\u03F1'},new[]{'\u03C2','\u03C3'},new[]{'\u03C6','\u03D5'},
new[]{'\u03C9','\u2126'},new[]{'\u0392','\u03D0'},new[]{'\u0395','\u03F5'},
new[]{'\u03D1','\u03F4'},new[]{'\u0398','\u03D1'},new[]{'\u0398','\u03F4'},
new[]{'\u0345','\u1FBE'},new[]{'\u0345','\u0399'},new[]{'\u0399','\u1FBE'},
new[]{'\u039A','\u03F0'},new[]{'\u00B5','\u039C'},new[]{'\u03A0','\u03D6'},
new[]{'\u03A1','\u03F1'},new[]{'\u03A3','\u03C2'},new[]{'\u03A6','\u03D5'},
new[]{'\u03A9','\u2126'},new[]{'\u0398','\u03F4'},new[]{'\u03B8','\u03F4'},
new[]{'\u03B8','\u03D1'},new[]{'\u0398','\u03D1'},new[]{'\u0432','\u1C80'},
new[]{'\u0434','\u1C81'},new[]{'\u043E','\u1C82'},new[]{'\u0441','\u1C83'},
new[]{'\u0442','\u1C84'},new[]{'\u0442','\u1C85'},new[]{'\u1C84','\u1C85'},
new[]{'\u044A','\u1C86'},new[]{'\u0412','\u1C80'},new[]{'\u0414','\u1C81'},
new[]{'\u041E','\u1C82'},new[]{'\u0421','\u1C83'},new[]{'\u1C84','\u1C85'},
new[]{'\u0422','\u1C84'},new[]{'\u0422','\u1C85'},new[]{'\u042A','\u1C86'},
new[]{'\u0463','\u1C87'},new[]{'\u0462','\u1C87'}
      };
      int upper = 0, lower = 0;
      foreach (char[] pair in pairs) {
         Console.Write(
            "U+{0:X4} U+{1:X4} pass: ",
            Convert.ToInt32(pair[0]),
            Convert.ToInt32(pair[1])
         );
         if (Char.ToUpper(pair[0]) == Char.ToUpper(pair[1])) {
            Console.Write("ToUpper ");
            upper++;
         } else {
            Console.Write("        ");
         }
         if (Char.ToLower(pair[0]) == Char.ToLower(pair[1])) {
            Console.Write("ToLower");
            lower++;
         }
         Console.WriteLine();
      }
      Console.WriteLine("upper pass: {0}, lower pass: {1}", upper, lower);
   }
}

结果如下。 注意我还使用 Invariant 版本进行了测试,结果是
完全一样。 有趣的是,其中一对都失败了。 但基于此
ToUpper 是最好的选择

U+00E5 U+212B pass:         ToLower
U+00C5 U+212B pass:         ToLower
U+0399 U+1FBE pass: ToUpper
U+03B9 U+1FBE pass: ToUpper
U+03B2 U+03D0 pass: ToUpper
U+03B5 U+03F5 pass: ToUpper
U+03B8 U+03D1 pass: ToUpper
U+03B8 U+03F4 pass:         ToLower
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: ToUpper
U+0345 U+03B9 pass: ToUpper
U+0345 U+1FBE pass: ToUpper
U+03BA U+03F0 pass: ToUpper
U+00B5 U+03BC pass: ToUpper
U+03C0 U+03D6 pass: ToUpper
U+03C1 U+03F1 pass: ToUpper
U+03C2 U+03C3 pass: ToUpper
U+03C6 U+03D5 pass: ToUpper
U+03C9 U+2126 pass:         ToLower
U+0392 U+03D0 pass: ToUpper
U+0395 U+03F5 pass: ToUpper
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: ToUpper
U+0398 U+03F4 pass:         ToLower
U+0345 U+1FBE pass: ToUpper
U+0345 U+0399 pass: ToUpper
U+0399 U+1FBE pass: ToUpper
U+039A U+03F0 pass: ToUpper
U+00B5 U+039C pass: ToUpper
U+03A0 U+03D6 pass: ToUpper
U+03A1 U+03F1 pass: ToUpper
U+03A3 U+03C2 pass: ToUpper
U+03A6 U+03D5 pass: ToUpper
U+03A9 U+2126 pass:         ToLower
U+0398 U+03F4 pass:         ToLower
U+03B8 U+03F4 pass:         ToLower
U+03B8 U+03D1 pass: ToUpper
U+0398 U+03D1 pass: ToUpper
U+0432 U+1C80 pass: ToUpper
U+0434 U+1C81 pass: ToUpper
U+043E U+1C82 pass: ToUpper
U+0441 U+1C83 pass: ToUpper
U+0442 U+1C84 pass: ToUpper
U+0442 U+1C85 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+044A U+1C86 pass: ToUpper
U+0412 U+1C80 pass: ToUpper
U+0414 U+1C81 pass: ToUpper
U+041E U+1C82 pass: ToUpper
U+0421 U+1C83 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+0422 U+1C84 pass: ToUpper
U+0422 U+1C85 pass: ToUpper
U+042A U+1C86 pass: ToUpper
U+0463 U+1C87 pass: ToUpper
U+0462 U+1C87 pass: ToUpper
upper pass: 46, lower pass: 8

I wanted some actual data on this, so I pulled the full list of two byte
characters that fail with ToLower or ToUpper. I then ran this test below:

using System;

class Program {
   static void Main() {
      char[][] pairs = {
new[]{'\u00E5','\u212B'},new[]{'\u00C5','\u212B'},new[]{'\u0399','\u1FBE'},
new[]{'\u03B9','\u1FBE'},new[]{'\u03B2','\u03D0'},new[]{'\u03B5','\u03F5'},
new[]{'\u03B8','\u03D1'},new[]{'\u03B8','\u03F4'},new[]{'\u03D1','\u03F4'},
new[]{'\u03B9','\u1FBE'},new[]{'\u0345','\u03B9'},new[]{'\u0345','\u1FBE'},
new[]{'\u03BA','\u03F0'},new[]{'\u00B5','\u03BC'},new[]{'\u03C0','\u03D6'},
new[]{'\u03C1','\u03F1'},new[]{'\u03C2','\u03C3'},new[]{'\u03C6','\u03D5'},
new[]{'\u03C9','\u2126'},new[]{'\u0392','\u03D0'},new[]{'\u0395','\u03F5'},
new[]{'\u03D1','\u03F4'},new[]{'\u0398','\u03D1'},new[]{'\u0398','\u03F4'},
new[]{'\u0345','\u1FBE'},new[]{'\u0345','\u0399'},new[]{'\u0399','\u1FBE'},
new[]{'\u039A','\u03F0'},new[]{'\u00B5','\u039C'},new[]{'\u03A0','\u03D6'},
new[]{'\u03A1','\u03F1'},new[]{'\u03A3','\u03C2'},new[]{'\u03A6','\u03D5'},
new[]{'\u03A9','\u2126'},new[]{'\u0398','\u03F4'},new[]{'\u03B8','\u03F4'},
new[]{'\u03B8','\u03D1'},new[]{'\u0398','\u03D1'},new[]{'\u0432','\u1C80'},
new[]{'\u0434','\u1C81'},new[]{'\u043E','\u1C82'},new[]{'\u0441','\u1C83'},
new[]{'\u0442','\u1C84'},new[]{'\u0442','\u1C85'},new[]{'\u1C84','\u1C85'},
new[]{'\u044A','\u1C86'},new[]{'\u0412','\u1C80'},new[]{'\u0414','\u1C81'},
new[]{'\u041E','\u1C82'},new[]{'\u0421','\u1C83'},new[]{'\u1C84','\u1C85'},
new[]{'\u0422','\u1C84'},new[]{'\u0422','\u1C85'},new[]{'\u042A','\u1C86'},
new[]{'\u0463','\u1C87'},new[]{'\u0462','\u1C87'}
      };
      int upper = 0, lower = 0;
      foreach (char[] pair in pairs) {
         Console.Write(
            "U+{0:X4} U+{1:X4} pass: ",
            Convert.ToInt32(pair[0]),
            Convert.ToInt32(pair[1])
         );
         if (Char.ToUpper(pair[0]) == Char.ToUpper(pair[1])) {
            Console.Write("ToUpper ");
            upper++;
         } else {
            Console.Write("        ");
         }
         if (Char.ToLower(pair[0]) == Char.ToLower(pair[1])) {
            Console.Write("ToLower");
            lower++;
         }
         Console.WriteLine();
      }
      Console.WriteLine("upper pass: {0}, lower pass: {1}", upper, lower);
   }
}

Result below. Note I also tested with the Invariant versions, and result was
exact same. Interestingly, one of the pairs fails with both. But based on this
ToUpper is the best option.

U+00E5 U+212B pass:         ToLower
U+00C5 U+212B pass:         ToLower
U+0399 U+1FBE pass: ToUpper
U+03B9 U+1FBE pass: ToUpper
U+03B2 U+03D0 pass: ToUpper
U+03B5 U+03F5 pass: ToUpper
U+03B8 U+03D1 pass: ToUpper
U+03B8 U+03F4 pass:         ToLower
U+03D1 U+03F4 pass:
U+03B9 U+1FBE pass: ToUpper
U+0345 U+03B9 pass: ToUpper
U+0345 U+1FBE pass: ToUpper
U+03BA U+03F0 pass: ToUpper
U+00B5 U+03BC pass: ToUpper
U+03C0 U+03D6 pass: ToUpper
U+03C1 U+03F1 pass: ToUpper
U+03C2 U+03C3 pass: ToUpper
U+03C6 U+03D5 pass: ToUpper
U+03C9 U+2126 pass:         ToLower
U+0392 U+03D0 pass: ToUpper
U+0395 U+03F5 pass: ToUpper
U+03D1 U+03F4 pass:
U+0398 U+03D1 pass: ToUpper
U+0398 U+03F4 pass:         ToLower
U+0345 U+1FBE pass: ToUpper
U+0345 U+0399 pass: ToUpper
U+0399 U+1FBE pass: ToUpper
U+039A U+03F0 pass: ToUpper
U+00B5 U+039C pass: ToUpper
U+03A0 U+03D6 pass: ToUpper
U+03A1 U+03F1 pass: ToUpper
U+03A3 U+03C2 pass: ToUpper
U+03A6 U+03D5 pass: ToUpper
U+03A9 U+2126 pass:         ToLower
U+0398 U+03F4 pass:         ToLower
U+03B8 U+03F4 pass:         ToLower
U+03B8 U+03D1 pass: ToUpper
U+0398 U+03D1 pass: ToUpper
U+0432 U+1C80 pass: ToUpper
U+0434 U+1C81 pass: ToUpper
U+043E U+1C82 pass: ToUpper
U+0441 U+1C83 pass: ToUpper
U+0442 U+1C84 pass: ToUpper
U+0442 U+1C85 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+044A U+1C86 pass: ToUpper
U+0412 U+1C80 pass: ToUpper
U+0414 U+1C81 pass: ToUpper
U+041E U+1C82 pass: ToUpper
U+0421 U+1C83 pass: ToUpper
U+1C84 U+1C85 pass: ToUpper
U+0422 U+1C84 pass: ToUpper
U+0422 U+1C85 pass: ToUpper
U+042A U+1C86 pass: ToUpper
U+0463 U+1C87 pass: ToUpper
U+0462 U+1C87 pass: ToUpper
upper pass: 46, lower pass: 8
猫腻 2024-07-15 01:41:44

这真的不应该重要。 对于 ASCII 字符,这绝对不重要 - 只需进行一些比较和任一方向的一点翻转即可。 Unicode 可能有点复杂,因为有些字符会以奇怪的方式更改大小写,但除非您的文本充满了这些特殊字符,否则实际上不应该有任何区别。

It really shouldn't ever matter. With ASCII characters, it definitely doesn't matter - it's just a few comparisons and a bit flip for either direction. Unicode might be a little more complicated, since there are some characters that change case in weird ways, but there really shouldn't be any difference unless your text is full of those special characters.

小嗲 2024-07-15 01:41:44

如果做得正确,如果转换为小写字母,应该会有一个小的、微不足道的速度优势,但正如许多人所暗示的那样,这是文化相关的,并且不是在函数中继承,而是在您转换的字符串中继承(大量小写字母)意味着很少的内存分配)——如果您有一个包含大量大写字母的字符串,则转换为大写字母会更快。

Doing it right, there should be a small, insignificant speed advantage if you convert to lower case, but this is, as many has hinted, culture dependent and is not inherit in the function but in the strings you convert (lots of lower case letters means few assignments to memory) -- converting to upper case is faster if you have a string with lots of upper case letters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文