使用 String.ToUpperInvariant() 对字符串进行规范化

发布于 2024-07-17 18:36:09 字数 1136 浏览 4 评论 0原文

我目前在 SQL Server 数据库中以小写形式存储标准化版本的字符串。例如，在我的 Users 表中，我有一个 UserName 和 LoweredUserName 字段。根据上下文，我使用 T-SQL 的 LOWER() 函数或 C# 的 String.ToLower() 方法来生成用户名的小写版本以填充 LoweredUserName 字段。根据Microsoft 指南和Visual Studio 的代码分析规则 CA1308，我应该使用 C# 的 String.ToUpperInvariant() 而不是 ToLower()。根据 Microsoft 的说法，这既是一个性能问题，也是一个全球化问题：转换为大写字母是安全的，而转换为小写字母可能会导致信息丢失（例如，土耳其语“我”问题）。

如果我转向使用 ToUpperInvariant 进行字符串规范化，我还必须更改我的数据库架构，因为我的架构基于 Microsoft 的 ASP.NET 会员框架（请参阅此相关问题），它将字符串标准化为小写。

Microsoft 告诉我们在 C# 中使用大写规范化，而它自己的成员资格表和过程中的代码却使用小写规范化，这不是自相矛盾吗？我应该将所有内容切换为大写标准化，还是继续使用小写标准化？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

看春风乍起 2024-07-24 18:36:09

根据CA1308，这样做的原因是某些字符不能往返从大写转换为小写。重要的是你总是朝一个方向移动，所以如果你的标准是总是移动到小写，那么没有理由改变它。

回复收藏 0 原文

我不咬妳我踢妳 2024-07-24 18:36:09

回答你的第一个问题，是的，微软有点不一致。要回答您的第二个问题，在您确认这会导致应用程序出现瓶颈之前，请勿切换任何内容。

想想你可以在你的项目上取得多少进展，而不是浪费时间切换一切。您的开发时间比您从此类更改中获得的节省更有价值。

记住：

过早的优化是编程中万恶（或至少是大部分）的根源。 - 唐纳德·高德纳

回复收藏 0 原文

陪你搞怪i 2024-07-24 18:36:09

对于那些想知道为什么建议使用“大写”的人，这里是所谓的“土耳其语 I 问题”的演示：

using System;
                    
public class Program
{
    public static void Main()
    {
        System.Threading.Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("tr-TR");
        var input = "\u0131"; // LATIN SMALL LETTER DOTLESS I
        Console.WriteLine("Lower then upper equals original: " + (input.ToLower().ToUpper() == input));
        Console.WriteLine("Upper then lower equals original: " + (input.ToUpper().ToLower() == input));
    }
}

上面的代码产生以下输出：

Lower then upper equals original: False
Upper then lower equals original: True

这表明，当使用土耳其文化时（即 tr -TR），如果通过将字符串全部转换为大写来规范化字符串，那么稍后将这些大写字符串转换为小写时，您将获得原始字符串。如果您标准化为小写，您将无法返回原始字符串（即您不能往返）。

我不确定这一切在其他语言中是如何发挥作用的（Unicode 是一件很混乱的事情），但至少对于土耳其语来说，人们可以明白为什么建议使用大写字母而不是小写字母。

For anyone else who lands here wondering why "uppercase" is recommended, here is a demonstration of the so called "Turkish I Problem":

using System;
                    
public class Program
{
    public static void Main()
    {
        System.Threading.Thread.CurrentThread.CurrentCulture = new System.Globalization.CultureInfo("tr-TR");
        var input = "\u0131"; // LATIN SMALL LETTER DOTLESS I
        Console.WriteLine("Lower then upper equals original: " + (input.ToLower().ToUpper() == input));
        Console.WriteLine("Upper then lower equals original: " + (input.ToUpper().ToLower() == input));
    }
}

The above code produces the following output:

Lower then upper equals original: False
Upper then lower equals original: True

This shows that when using the Turkish culture (i.e. tr-TR), if you normalize strings by converting them all to uppercase, you will get the original string if you later convert those uppercase strings to lowercase. If instead you normalize to lowercase, you won't be able to get back to the original string (i.e. you can't roundtrip).

I am not sure how all this plays out with other languages (Unicode is messy business) but at least for Turkish one can see why uppercase is recommended over lowercase.

回复收藏 0 原文