如何在 .net (c#) 中为可以安全存储在数据库中的字符串创建 HashCode?

发布于 2024-10-19 18:00:40 字数 653 浏览 10 评论 0原文

引用 Eric Lippert 的 GetHashCode 指南和规则

规则:GetHashCode 的使用者不能依赖它随着时间的推移或跨应用程序域的稳定

假设您有一个 Customer 对象 有很多字段,例如名称, 地址等等。如果你做两个 这些物体具有完全相同的 两个不同进程中的数据,它们 不必返回相同的哈希值 代码。如果你在上面制作这样一个对象 周二,在一个过程中,将其关闭, 并再次运行该程序 星期三,哈希码可以是 不同。

这在过去曾咬过人。 的文档 System.String.GetHashCode 注释 特别是两个相同的 字符串可以有不同的哈希码 在不同版本的 CLR 中,以及 事实上他们确实如此。 不要将字符串哈希值存储在数据库中并期望它们永远相同,因为事实并非如此。

那么创建可以存储在数据库中的字符串的 HashCode 的正确方法是什么?数据库?

(请告诉我,我不是第一个在我编写的软件中留下此错误的人!)

To quote from Guidelines and rules for GetHashCode by Eric Lippert:

Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains

Suppose you have a Customer object
that has a bunch of fields like Name,
Address, and so on. If you make two
such objects with exactly the same
data in two different processes, they
do not have to return the same hash
code. If you make such an object on
Tuesday in one process, shut it down,
and run the program again on
Wednesday, the hash codes can be
different.

This has bitten people in the past.
The documentation for
System.String.GetHashCode notes
specifically that two identical
strings can have different hash codes
in different versions of the CLR, and
in fact they do. Don't store string hashes in databases and expect them to be the same forever, because they won't be.

So what is the correct way to create a HashCode of a string that I can store in a database?

(Please tell me I am not the first person to have left this bug in software I have written!)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

往事随风而去 2024-10-26 18:00:40

这取决于您希望该哈希具有哪些属性。例如,您可以编写如下内容:

public int HashString(string text)
{
    // TODO: Determine nullity policy.

    unchecked
    {
        int hash = 23;
        foreach (char c in text)
        {
            hash = hash * 31 + c;
        }
        return hash;
    }
}

只要您记录哈希值的计算方式,那就是有效的。它绝不是加密安全的或类似的东西,但您可以毫无问题地保留它。在序数意义上绝对相等的两个字符串(即没有应用文化平等等,逐个字符完全相同)将使用此代码产生相同的哈希值。

当您依赖未记录的散列时,问题就出现了 - 即遵循GetHashCode()的东西,但绝不保证在版本之间保持相同......就像<代码>string.GetHashCode()。

像这样编写和记录您自己的哈希有点像说“此敏感信息是使用 MD5(或其他)进行哈希处理的”。只要它是一个明确定义的哈希值,就可以了。

编辑:其他答案建议使用加密哈希,例如 SHA-1 或 MD5。我想说,在我们知道需要加密安全性而不仅仅是稳定性之前,将字符串转换为字节数组并对其进行哈希处理的繁琐工作是没有意义的。当然,如果哈希用于任何与安全相关的事情,那么行业标准哈希正是您应该达到的目标。但问题中没有提到这一点。

It depends what properties you want that hash to have. For example, you could just write something like this:

public int HashString(string text)
{
    // TODO: Determine nullity policy.

    unchecked
    {
        int hash = 23;
        foreach (char c in text)
        {
            hash = hash * 31 + c;
        }
        return hash;
    }
}

So long as you document that that is how the hash is computed, that's valid. It's in no way cryptographically secure or anything like that, but you can persist it with no problems. Two strings which are absolutely equal in the ordinal sense (i.e. with no cultural equality etc applied, exactly character-by-character the same) will produce the same hash with this code.

The problems come when you rely on undocumented hashing - i.e. something which obeys GetHashCode() but is in no way guaranteed to remain the same from version to version... like string.GetHashCode().

Writing and documenting your own hash like this is a bit like saying, "This sensitive information is hashed with MD5 (or whatever)". So long as it's a well-defined hash, that's fine.

EDIT: Other answers have suggested using cryptographic hashes such as SHA-1 or MD5. I would say that until we know there's a requirement for cryptographic security rather than just stability, there's no point in going through the rigmarole of converting the string to a byte array and hashing that. Of course if the hash is meant to be used for anything security-related, an industry-standard hash is exactly what you should be reaching for. But that wasn't mentioned anywhere in the question.

歌入人心 2024-10-26 18:00:40

这是 .NET 计算 64 位字符串哈希码的当前方式的重新实现系统。这不像真正的 GetHashCode() 那样使用指针,因此速度会稍慢,但它确实使其对 string 的内部更改更具弹性,这将给出比 Jon Skeet 的版本 分布更均匀的哈希码,这可能会缩短字典中的查找时间。

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}

Here is a reimplementation of the current way .NET calculates it's string hash code for 64 bit systems. This does not use pointers like the real GetHashCode() does so it will be slightly slower, but it does make it more resilient to internal changes to string, this will give a more evenly distributed hash code than Jon Skeet's version which may result in better lookup times in dictionaries.

public static class StringExtensionMethods
{
    public static int GetStableHashCode(this string str)
    {
        unchecked
        {
            int hash1 = 5381;
            int hash2 = hash1;

            for(int i = 0; i < str.Length && str[i] != '\0'; i += 2)
            {
                hash1 = ((hash1 << 5) + hash1) ^ str[i];
                if (i == str.Length - 1 || str[i+1] == '\0')
                    break;
                hash2 = ((hash2 << 5) + hash2) ^ str[i+1];
            }

            return hash1 + (hash2*1566083941);
        }
    }
}
凡尘雨 2024-10-26 18:00:40

现在有 System.IO.Hashing 包,它提供稳定且标准化的非- 加密哈希算法。虽然它们是为字节序列设计的,但通过 Span 安全且高效地使用它们相当简单:

var input = "Hello world";
var inputBytes = MemoryMarshal.AsBytes(input.AsSpan());
var hash = System.IO.Hashing.XxHash32.HashToUInt32(inputBytes);
Console.WriteLine(hash); // 899079058

但请注意,由于将字符重新解释为字节,系统的字节顺序会影响结果,所以如果你迁移到大端系统,上面的哈希将会不同。如果这是一个问题,您可以检查 BitConverter.IsLittleEndian 并交换字节(如果为 false)。

There is now the System.IO.Hashing package that provides stable and standardized non-cryptographic hash algorithms. While they are designed for byte sequences, it is fairly straightforward to use them safely and very efficiently through Span:

var input = "Hello world";
var inputBytes = MemoryMarshal.AsBytes(input.AsSpan());
var hash = System.IO.Hashing.XxHash32.HashToUInt32(inputBytes);
Console.WriteLine(hash); // 899079058

Note however that, due to the reinterpretation of characters as bytes, the endianness of the system affects the result, so if you move to a big-endian system, the hash above will be different. If that is an issue, you can check BitConverter.IsLittleEndian and swap the bytes if it's false.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文