如何计算字符串列表的良好哈希码?

发布于 2024-08-30 19:25:59 字数 937 浏览 8 评论 0原文

背景:

  • 我有一个简短的字符串列表。
  • 字符串的数量并不总是相同,但几乎总是“少数”的数量级。
  • 我们的数据库将这些字符串存储在第二个规范化表中。
  • 这些字符串一旦写入,就永远改变。到数据库。

我们希望能够在查询中快速匹配这些字符串,而不会因为进行大量连接而影响性能。

因此,我正在考虑将所有这些字符串的哈希码存储在主表中并将其包含在我们的索引中,因此只有当哈希码匹配时数据库才会处理连接。

那么如何获得一个好的哈希码呢?我可以:

  • 将所有字符串的哈希码一起异
  • 或 异或将每个字符串后的结果相乘(例如乘以 31)
  • 将所有字符串放在一起,然后获取哈希码
  • 其他方式

那么人们怎么想?


最后,我只是连接字符串并计算连接的哈希码,因为它很简单并且工作得很好。

(如果您关心的话,我们正在使用 .NET 和 SqlServer)


错误!错误!

引用 Eric Lippert 的 GetHashCode 指南和规则

文档 System.String.GetHashCode 注释 特别是两个相同的 字符串可以有不同的哈希码 在不同版本的 CLR 中,以及 事实上他们确实如此。不存储字符串 数据库中的哈希值并期望它们 永远一样,因为他们 不会的。

因此 String.GetHashcode() 不应该用于此目的。

Background:

  • I have a short list of strings.
  • The number of strings is not always the same, but are nearly always of the order of a “handful”
  • In our database will store these strings in a 2nd normalised table
  • These strings are never changed once they are written to the database.

We wish to be able to match on these strings quickly in a query without the performance hit of doing lots of joins.

So I am thinking of storing a hash code of all these strings in the main table and including it in our index, so the joins are only processed by the database when the hash code matches.

So how do I get a good hashcode? I could:

  • Xor the hash codes of all the string together
  • Xor with multiply the result after each string (say by 31)
  • Cat all the string together then get the hashcode
  • Some other way

So what do people think?


In the end I just concatenate the strings and compute the hashcode for the concatenation, as it is simple and worked well enough.

(If you care we are using .NET and SqlServer)


Bug!, Bug!

Quoting from Guidelines and rules for GetHashCode by Eric Lippert

The documentation for
System.String.GetHashCode notes
specifically that two identical
strings can have different hash codes
in different versions of the CLR, and
in fact they do. Don't store string
hashes in databases and expect them to
be the same forever, because they
won't be.

So String.GetHashcode() should not be used for this.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

音盲 2024-09-06 19:25:59

标准的java实践,就是简单的写

final int prime = 31;
int result = 1;
for( String s : strings )
{
    result = result * prime + s.hashCode();
}
// result is the hashcode.

Standard java practise, is to simply write

final int prime = 31;
int result = 1;
for( String s : strings )
{
    result = result * prime + s.hashCode();
}
// result is the hashcode.
浮光之海 2024-09-06 19:25:59

我认为没有理由不连接字符串并计算连接的哈希码。

打个比方,假设我想计算一个内存块的 MD5 校验和,我不会将该块分割成更小的块并为它们计算单独的 MD5 校验和,然后用一些临时方法将它们组合起来。

I see no reason not to concatenate the strings and compute the hashcode for the concatenation.

As an analogy, say that I wanted to compute a MD5 checksum for a memory block, I wouldn't split the block up into smaller pieces and compute individual MD5 checksums for them and then combine them with some ad hoc method.

水水月牙 2024-09-06 19:25:59

您的第一个选项唯一的不便是 (String1, String2) 生成与 (String2, String1) 相同的哈希码。如果这不是问题(例如,因为您有修复订单),那就没问题。

将所有字符串放在一起,然后获取哈希码”对我来说似乎更自然、更安全。

更新:正如评论所指出的,这有一个缺点,即列表 ("x", "yz") 和 ("xy","z") 会给出相同的哈希值。为了避免这种情况,您可以使用不能出现在字符串内部的字符串分隔符来连接字符串。

如果字符串很大,您可能更愿意对每个字符串进行散列,对散列码进行分类并重新散列结果。更多的CPU,更少的内存。

Your first option has the only inconvenience of (String1, String2) producing the same hashcode of (String2, String1). If that's not a problem (eg. because you have a fix order) it's fine.

"Cat all the string together then get the hashcode" seems the more natural and secure to me.

Update: As a comment points out, this has the drawback that the list ("x", "yz") and ("xy","z") would give the same hash. To avoid this, you could join the strings with a string delimiter that cannot appear inside the strings.

If the strings are big, you might prefer to hash each one, cat the hashcodes and rehash the result. More CPU, less memory.

删除会话 2024-09-06 19:25:59

我脑海中浮现的另一种方式是,根据索引使用旋转散列进行链式异或:

int shift = 0;
int result = 1;
for(String s : strings)
{
    result ^= (s.hashCode() << shift) | (s.hashCode() >> (32-shift)) & (1 << shift - 1);
    shift = (shift+1)%32;
}

编辑:阅读有效java中给出的解释,我认为杰夫的代码会更有效。

Another way that pops in my head, chain xors with rotated hashes based on index:

int shift = 0;
int result = 1;
for(String s : strings)
{
    result ^= (s.hashCode() << shift) | (s.hashCode() >> (32-shift)) & (1 << shift - 1);
    shift = (shift+1)%32;
}

edit: reading the explanation given in effective java, I think geoff's code would be much more efficient.

谈情不如逗狗 2024-09-06 19:25:59

基于 SQL 的解决方案可以基于 checksum 和 checksum_agg 函数。如果我没理解错的话,你会得到类似的结果:

MyTable
  MyTableId
  HashCode

MyChildTable
  MyTableId  (foreign key into MyTable)
  String

给定项目 (MyTableId) 的各种字符串存储在 MyChildTable 中。要计算和存储反映这些(永远不会更改)字符串的校验和,应该像这样工作:

UPDATE MyTable
 set HashCode = checksum_agg(checksum(string))
 from MyTable mt
  inner join MyChildTable ct
   on ct.MyTableId = mt.MyTableId
 where mt.MyTableId = @OnlyForThisOne

我相信这是与顺序无关的,因此字符串“The Quick Brown”将产生与“Brown The Quick”相同的校验和”。

A SQL-based solution could be based on the checksum and checksum_agg functions. If I'm following it right, you have something like:

MyTable
  MyTableId
  HashCode

MyChildTable
  MyTableId  (foreign key into MyTable)
  String

with the various strings for a given item (MyTableId) stored in MyChildTable. To calculate and store a checksum reflecting these (never-to-be-changed) strings, something like this should work:

UPDATE MyTable
 set HashCode = checksum_agg(checksum(string))
 from MyTable mt
  inner join MyChildTable ct
   on ct.MyTableId = mt.MyTableId
 where mt.MyTableId = @OnlyForThisOne

I believe this is order-independant, so strings "The quick brown" would produce the same checksum as "brown The quick".

梦亿 2024-09-06 19:25:59

我希望这是不必要的,但是由于您没有提到任何听起来像是您只是使用哈希码进行第一次检查,然后再验证字符串实际上是否相等的内容,所以我觉得有必要警告您:

哈希码相等!=值相等

会有很多组字符串产生相同的哈希码,但并不总是相等。

I hope this is unnecessary, but since you don't mention anything which sounds like you're only using the hashcodes for a first check and then later verifying that the strings are actually equal, I feel the need to warn you:

Hashcode equality != value equality

There will be lots of sets of strings which yield the identical hashcode, but won't always be equal.

╭⌒浅淡时光〆 2024-09-06 19:25:59

所以我明白,您实际上有一些需要通过哈希码来识别的字符串集,并且您需要在其中识别的那组字符串永远不会改变?

如果是这种情况,这并不特别重要,只要您使用的方案为不同的字符串/字符串组合提供唯一的数字即可。我首先连接字符串并计算 String.hashCode() 并查看最终是否得到唯一的数字。如果不这样做,那么您可以尝试:

  • 不要连接字符串,而是连接组成字符串的哈希码,并尝试不同的乘数(例如,如果您想识别两个字符串序列的组合,请尝试 HC1 + 17 * HC2,如果如果没有给出唯一的数字,请尝试 HC1 + 31 * HC2,然后尝试 19,然后尝试 37 等等——本质上任何小的奇数都可以。
  • 如果您没有以这种方式获得唯一的数字 - 或者如果您需要应对扩展的可能性集 - 那么请考虑更强的哈希码。 64 位哈希码是易于比较和哈希唯一性之间的良好折衷方案。

64 位哈希码的可能方案如下:

  • 使用相当强大的方案生成 256 个 64 位随机数的数组(您可以使用 SecureRandom,尽管 XORShift 方案可以正常工作)
  • 选择“m”,另一个“随机”64 位奇数,其位或多或少一半设置
  • 生成哈希码,遍历每个字节值 b,组成字符串,并从随机数数组中取出第 b 个数字;然后将其与当前哈希值乘以“m”进行异或或相加。

因此,基于数值配方中建议的值的实现将是:

  private static final long[] byteTable;
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;

  static {
    byteTable = new long[256];
    long h = 0x544B2FBACAAF1684L;
    for (int i = 0; i < 256; i++) {
      for (int j = 0; j < 31; j++) {
        h = (h >>> 7) ^ h;
        h = (h << 11) ^ h;
        h = (h >>> 10) ^ h;
      }
      byteTable[i] = h;
    }
  }

上面正在初始化我们的随机数数组。我们使用 XORShift 生成器,但我们实际上可以使用任何质量相当好的随机数生成器(使用特定种子创建 SecureRandom(),然后调用 nextLong() 就可以了)。然后,生成哈希码:

  public static long hashCode(String cs) {
    if (cs == null) return 1L;
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    for (int i = cs.length()-1; i >= 0; i--) {
      char ch = cs.charAt(i);
      h = (h * hmult) ^ ht[ch & 0xff];
      h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
    }
    return h;
  }

要考虑的指南是,给定 n 位哈希码,平均而言,您需要生成 2^(n/2) 个字符串的哈希值,然后才能获得碰撞。或者换句话说,对于 64 位哈希,您预计在大约 40 亿个字符串后会发生冲突(因此,如果您要处理多达几百万个字符串,则发生冲突的可能性可以忽略不计) )。

另一种选择是 MD5,它是一个非常强大的哈希值(实际上安全),但它是 128 位哈希值,因此您有一个必须处理 128 位值的轻微缺点。我想说 MD5 对于这些目的来说有点过分了——正如我所说,使用 64 位散列,您可以相当安全地处理几百万个字符串。

(抱歉,我应该澄清一下——MD5 被设计为一种安全散列,只是后来发现它不安全。“安全”散列是指给定特定散列时,故意构造会导致以下情况的输入是不可行的:在某些情况下——但不像我所理解的——你可能需要这个属性,另一方面,如果你正在处理用户输入的数据——即a。恶意用户可能会故意尝试混淆您的系统。您可能还会对我过去写的以下内容感兴趣:

So I understand, you effectively have some set of strings that you need to identify by hash code, and that set of strings that you need to identify among will never change?

If that's the case, it doesn't particularly matter, so long as the scheme you use gives you unique numbers for the different strings/combinations of strings. I would start by just concatenating the strings and calculating the String.hashCode() and seeing if you end up with unique numbers. If you don't, then you could try:

  • instead of concatenating strings, concatenate hash codes of the component strings, and try different multipliers (e.g. if you want to identify combiantions of two-string sequences, try HC1 + 17 * HC2, if that doesn't give unique numbers, try HC1 + 31 * HC2, then try 19, then try 37 etc -- essentially any small-ish odd number will do fine).
  • if you don't get unique numbers in this way-- or if you need to cope with the set of possibilities expanding-- then consider a stronger hash code. A 64-bit hash code is a good compromise between ease of comparison and likelihood of hashes being unique.

A possible scheme for a 64-bit hash code is as follows:

  • generate an array of 256 64-bit random numbers using a fairly strong scheme (you could use SecureRandom, though the XORShift scheme would work fine)
  • pick "m", another "random" 64-bit, odd number with more or less half of its bits set
  • to generate a hash code, go through each byte value, b, making up the string, and take the bth number from your array of random numbers; then XOR or add that with the current hash value, multiplied by "m"

So an implementation based on values suggested in Numerical Recipes would be:

  private static final long[] byteTable;
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;

  static {
    byteTable = new long[256];
    long h = 0x544B2FBACAAF1684L;
    for (int i = 0; i < 256; i++) {
      for (int j = 0; j < 31; j++) {
        h = (h >>> 7) ^ h;
        h = (h << 11) ^ h;
        h = (h >>> 10) ^ h;
      }
      byteTable[i] = h;
    }
  }

The above is initialising our array of random numbers. We use an XORShift generator, but we could really use any fairly good-quality random number generator (creating a SecureRandom() with a particular seed then calling nextLong() would be fine). Then, to generate a hash code:

  public static long hashCode(String cs) {
    if (cs == null) return 1L;
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    for (int i = cs.length()-1; i >= 0; i--) {
      char ch = cs.charAt(i);
      h = (h * hmult) ^ ht[ch & 0xff];
      h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
    }
    return h;
  }

A guide to consider is that given a hash code of n bits, on average you'd expect to have to generate hashes of in the order of 2^(n/2) strings before you get a collision. Or put another way, with a 64-bit hash, you'd expect a collision after around 4 billion strings (so if you're dealing with up to, say, a couple of million strings, the chances of a collision are pretty negligible).

Another option would be MD5, which is a very strong hash (practically secure), but it is a 128-bit hash, so you have the slight disadvantage of having to deal with 128-bit values. I would say MD5 is overkill for these purposes-- as I say, with a 64-bit hash, you can deal fairly safely with in the order of a few million strings.

(Sorry, I should clarify -- MD5 was designed as a secure hash, it's just that it's since found not to be secure. A "secure" hash is one where given a particular hash it's not feasible to deliberately construct input that would lead to that hash. In some circumstances-- but not as I understand in yours-- you would need this property. You might need it, on the other hand, if the strings you're dealing with a user-input data-- i.e. a malicious user could deliberately try to confuse your system. You might also be interetsed in the following I've written in the past:

手心的海 2024-09-06 19:25:59

使用 GetHashCode() 并不适合组合多个值。问题是对于字符串来说,哈希码只是一个校验和。对于相似的值来说,这留下了很少的熵。例如,为(“abc”,“bbc”)添加哈希码将与(“abd”,“abc”)相同,从而导致冲突。

如果您需要绝对确定,您可以使用真正的哈希算法,例如 SHA1、MD5 等。唯一的问题是它们是块函数,很难快速比较哈希值是否相等。相反,请尝试 CRC 或 FNV1 哈希。 FNV1 32 位非常简单:

public static class Fnv1 {
    public const uint OffsetBasis32 = 2166136261;
    public const uint FnvPrime32 = 16777619;

    public static int ComputeHash32(byte[] buffer) {
        uint hash = OffsetBasis32;

        foreach (byte b in buffer) {
            hash *= FnvPrime32;
            hash ^= b;
        }

        return (int)hash;
    }
}

Using the GetHashCode() is not ideal for combining multiple values. The problem is that for strings, the hashcode is just a checksum. This leaves little entropy for similar values. e.g. adding hashcodes for ("abc", "bbc") will be the same as ("abd", "abc"), causing a collision.

In cases where you need to be absolutely sure, you'd use a real hash algorithm, like SHA1, MD5, etc. The only problem is that they are block functions, which is difficult to quickly compare hashes for equality. Instead, try a CRC or FNV1 hash. FNV1 32-bit is super simple:

public static class Fnv1 {
    public const uint OffsetBasis32 = 2166136261;
    public const uint FnvPrime32 = 16777619;

    public static int ComputeHash32(byte[] buffer) {
        uint hash = OffsetBasis32;

        foreach (byte b in buffer) {
            hash *= FnvPrime32;
            hash ^= b;
        }

        return (int)hash;
    }
}
百合的盛世恋 2024-09-06 19:25:59

如果您碰巧使用 Java,则可以创建一个字符串数组(或将集合转换为数组),然后使用 Arrays.hashCode(),如文档 此处

If you happen to use Java, you can create an array of strings (or convert a collection to an array), and then use Arrays.hashCode() as documented here.

苏别ゝ 2024-09-06 19:25:59

让我们解决您的根本问题。

不要使用哈希码。只需为每个字符串添加一个整数主键

Let's solve your root problem.

Don't use a hashcode. Just add a integer primary key for each String

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文