当前位置：文江博客话题详情

如何计算字符串列表的良好哈希码？

发布于 2024-08-30 19:25:59 字数 937 浏览 8 评论 0原文

背景：

我有一个简短的字符串列表。
字符串的数量并不总是相同，但几乎总是“少数”的数量级。
我们的数据库将这些字符串存储在第二个规范化表中。
这些字符串一旦写入，就永远改变。到数据库。

我们希望能够在查询中快速匹配这些字符串，而不会因为进行大量连接而影响性能。

因此，我正在考虑将所有这些字符串的哈希码存储在主表中并将其包含在我们的索引中，因此只有当哈希码匹配时数据库才会处理连接。

那么如何获得一个好的哈希码呢？我可以：

将所有字符串的哈希码一起异
或异或将每个字符串后的结果相乘（例如乘以 31）
将所有字符串放在一起，然后获取哈希码
其他方式

那么人们怎么想？

最后，我只是连接字符串并计算连接的哈希码，因为它很简单并且工作得很好。

（如果您关心的话，我们正在使用 .NET 和 SqlServer）

错误！错误！

引用 Eric Lippert 的 GetHashCode 指南和规则

文档 System.String.GetHashCode 注释特别是两个相同的字符串可以有不同的哈希码在不同版本的 CLR 中，以及事实上他们确实如此。不存储字符串数据库中的哈希值并期望它们永远一样，因为他们不会的。

因此 String.GetHashcode() 不应该用于此目的。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

音盲 2024-09-06 19:25:59

标准的java实践，就是简单的写

final int prime = 31;
int result = 1;
for( String s : strings )
{
    result = result * prime + s.hashCode();
}
// result is the hashcode.

Standard java practise, is to simply write

final int prime = 31;
int result = 1;
for( String s : strings )
{
    result = result * prime + s.hashCode();
}
// result is the hashcode.

回复收藏 0 原文

浮光之海 2024-09-06 19:25:59

我认为没有理由不连接字符串并计算连接的哈希码。

打个比方，假设我想计算一个内存块的 MD5 校验和，我不会将该块分割成更小的块并为它们计算单独的 MD5 校验和，然后用一些临时方法将它们组合起来。

回复收藏 0 原文

水水月牙 2024-09-06 19:25:59

您的第一个选项唯一的不便是 (String1, String2) 生成与 (String2, String1) 相同的哈希码。如果这不是问题（例如，因为您有修复订单），那就没问题。

“将所有字符串放在一起，然后获取哈希码”对我来说似乎更自然、更安全。

更新：正如评论所指出的，这有一个缺点，即列表 ("x", "yz") 和 ("xy","z") 会给出相同的哈希值。为了避免这种情况，您可以使用不能出现在字符串内部的字符串分隔符来连接字符串。

如果字符串很大，您可能更愿意对每个字符串进行散列，对散列码进行分类并重新散列结果。更多的CPU，更少的内存。

回复收藏 0 原文

删除会话 2024-09-06 19:25:59

我脑海中浮现的另一种方式是，根据索引使用旋转散列进行链式异或：

int shift = 0;
int result = 1;
for(String s : strings)
{
    result ^= (s.hashCode() << shift) | (s.hashCode() >> (32-shift)) & (1 << shift - 1);
    shift = (shift+1)%32;
}

编辑：阅读有效java中给出的解释，我认为杰夫的代码会更有效。

Another way that pops in my head, chain xors with rotated hashes based on index:

int shift = 0;
int result = 1;
for(String s : strings)
{
    result ^= (s.hashCode() << shift) | (s.hashCode() >> (32-shift)) & (1 << shift - 1);
    shift = (shift+1)%32;
}

edit: reading the explanation given in effective java, I think geoff's code would be much more efficient.

回复收藏 0 原文

谈情不如逗狗 2024-09-06 19:25:59

基于 SQL 的解决方案可以基于 checksum 和 checksum_agg 函数。如果我没理解错的话，你会得到类似的结果：

MyTable
  MyTableId
  HashCode

MyChildTable
  MyTableId  (foreign key into MyTable)
  String

给定项目 (MyTableId) 的各种字符串存储在 MyChildTable 中。要计算和存储反映这些（永远不会更改）字符串的校验和，应该像这样工作：

UPDATE MyTable
 set HashCode = checksum_agg(checksum(string))
 from MyTable mt
  inner join MyChildTable ct
   on ct.MyTableId = mt.MyTableId
 where mt.MyTableId = @OnlyForThisOne

我相信这是与顺序无关的，因此字符串“The Quick Brown”将产生与“Brown The Quick”相同的校验和”。

A SQL-based solution could be based on the checksum and checksum_agg functions. If I'm following it right, you have something like:

MyTable
  MyTableId
  HashCode

MyChildTable
  MyTableId  (foreign key into MyTable)
  String

with the various strings for a given item (MyTableId) stored in MyChildTable. To calculate and store a checksum reflecting these (never-to-be-changed) strings, something like this should work:

UPDATE MyTable
 set HashCode = checksum_agg(checksum(string))
 from MyTable mt
  inner join MyChildTable ct
   on ct.MyTableId = mt.MyTableId
 where mt.MyTableId = @OnlyForThisOne

I believe this is order-independant, so strings "The quick brown" would produce the same checksum as "brown The quick".

回复收藏 0 原文

梦亿 2024-09-06 19:25:59

我希望这是不必要的，但是由于您没有提到任何听起来像是您只是使用哈希码进行第一次检查，然后再验证字符串实际上是否相等的内容，所以我觉得有必要警告您：

哈希码相等！=值相等

会有很多组字符串产生相同的哈希码，但并不总是相等。

回复收藏 0 原文

╭⌒浅淡时光〆 2024-09-06 19:25:59

所以我明白，您实际上有一些需要通过哈希码来识别的字符串集，并且您需要在其中识别的那组字符串永远不会改变？

如果是这种情况，这并不特别重要，只要您使用的方案为不同的字符串/字符串组合提供唯一的数字即可。我首先连接字符串并计算 String.hashCode() 并查看最终是否得到唯一的数字。如果不这样做，那么您可以尝试：

不要连接字符串，而是连接组成字符串的哈希码，并尝试不同的乘数（例如，如果您想识别两个字符串序列的组合，请尝试 HC1 + 17 * HC2，如果如果没有给出唯一的数字，请尝试 HC1 + 31 * HC2，然后尝试 19，然后尝试 37 等等——本质上任何小的奇数都可以。
如果您没有以这种方式获得唯一的数字 - 或者如果您需要应对扩展的可能性集 - 那么请考虑更强的哈希码。 64 位哈希码是易于比较和哈希唯一性之间的良好折衷方案。

64 位哈希码的可能方案如下：

使用相当强大的方案生成 256 个 64 位随机数的数组（您可以使用 SecureRandom，尽管 XORShift 方案可以正常工作）
选择“m”，另一个“随机”64 位奇数，其位或多或少一半设置
生成哈希码，遍历每个字节值 b，组成字符串，并从随机数数组中取出第 b 个数字；然后将其与当前哈希值乘以“m”进行异或或相加。

因此，基于数值配方中建议的值的实现将是：

  private static final long[] byteTable;
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;

  static {
    byteTable = new long[256];
    long h = 0x544B2FBACAAF1684L;
    for (int i = 0; i < 256; i++) {
      for (int j = 0; j < 31; j++) {
        h = (h >>> 7) ^ h;
        h = (h << 11) ^ h;
        h = (h >>> 10) ^ h;
      }
      byteTable[i] = h;
    }
  }

上面正在初始化我们的随机数数组。我们使用 XORShift 生成器，但我们实际上可以使用任何质量相当好的随机数生成器（使用特定种子创建 SecureRandom()，然后调用 nextLong() 就可以了）。然后，生成哈希码：

  public static long hashCode(String cs) {
    if (cs == null) return 1L;
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    for (int i = cs.length()-1; i >= 0; i--) {
      char ch = cs.charAt(i);
      h = (h * hmult) ^ ht[ch & 0xff];
      h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
    }
    return h;
  }

要考虑的指南是，给定 n 位哈希码，平均而言，您需要生成 2^(n/2) 个字符串的哈希值，然后才能获得碰撞。或者换句话说，对于 64 位哈希，您预计在大约 40 亿个字符串后会发生冲突（因此，如果您要处理多达几百万个字符串，则发生冲突的可能性可以忽略不计））。

另一种选择是 MD5，它是一个非常强大的哈希值（实际上安全），但它是 128 位哈希值，因此您有一个必须处理 128 位值的轻微缺点。我想说 MD5 对于这些目的来说有点过分了——正如我所说，使用 64 位散列，您可以相当安全地处理几百万个字符串。

（抱歉，我应该澄清一下——MD5 被设计为一种安全散列，只是后来发现它不安全。“安全”散列是指给定特定散列时，故意构造会导致以下情况的输入是不可行的：在某些情况下——但不像我所理解的——你可能需要这个属性，另一方面，如果你正在处理用户输入的数据——即a。恶意用户可能会故意尝试混淆您的系统。您可能还会对我过去写的以下内容感兴趣：

So I understand, you effectively have some set of strings that you need to identify by hash code, and that set of strings that you need to identify among will never change?

If that's the case, it doesn't particularly matter, so long as the scheme you use gives you unique numbers for the different strings/combinations of strings. I would start by just concatenating the strings and calculating the String.hashCode() and seeing if you end up with unique numbers. If you don't, then you could try:

instead of concatenating strings, concatenate hash codes of the component strings, and try different multipliers (e.g. if you want to identify combiantions of two-string sequences, try HC1 + 17 * HC2, if that doesn't give unique numbers, try HC1 + 31 * HC2, then try 19, then try 37 etc -- essentially any small-ish odd number will do fine).
if you don't get unique numbers in this way-- or if you need to cope with the set of possibilities expanding-- then consider a stronger hash code. A 64-bit hash code is a good compromise between ease of comparison and likelihood of hashes being unique.

A possible scheme for a 64-bit hash code is as follows:

generate an array of 256 64-bit random numbers using a fairly strong scheme (you could use SecureRandom, though the XORShift scheme would work fine)
pick "m", another "random" 64-bit, odd number with more or less half of its bits set
to generate a hash code, go through each byte value, b, making up the string, and take the bth number from your array of random numbers; then XOR or add that with the current hash value, multiplied by "m"

So an implementation based on values suggested in Numerical Recipes would be:

  private static final long[] byteTable;
  private static final long HSTART = 0xBB40E64DA205B064L;
  private static final long HMULT = 7664345821815920749L;

  static {
    byteTable = new long[256];
    long h = 0x544B2FBACAAF1684L;
    for (int i = 0; i < 256; i++) {
      for (int j = 0; j < 31; j++) {
        h = (h >>> 7) ^ h;
        h = (h << 11) ^ h;
        h = (h >>> 10) ^ h;
      }
      byteTable[i] = h;
    }
  }

The above is initialising our array of random numbers. We use an XORShift generator, but we could really use any fairly good-quality random number generator (creating a SecureRandom() with a particular seed then calling nextLong() would be fine). Then, to generate a hash code:

  public static long hashCode(String cs) {
    if (cs == null) return 1L;
    long h = HSTART;
    final long hmult = HMULT;
    final long[] ht = byteTable;
    for (int i = cs.length()-1; i >= 0; i--) {
      char ch = cs.charAt(i);
      h = (h * hmult) ^ ht[ch & 0xff];
      h = (h * hmult) ^ ht[(ch >>> 8) & 0xff];
    }
    return h;
  }

A guide to consider is that given a hash code of n bits, on average you'd expect to have to generate hashes of in the order of 2^(n/2) strings before you get a collision. Or put another way, with a 64-bit hash, you'd expect a collision after around 4 billion strings (so if you're dealing with up to, say, a couple of million strings, the chances of a collision are pretty negligible).

Another option would be MD5, which is a very strong hash (practically secure), but it is a 128-bit hash, so you have the slight disadvantage of having to deal with 128-bit values. I would say MD5 is overkill for these purposes-- as I say, with a 64-bit hash, you can deal fairly safely with in the order of a few million strings.

(Sorry, I should clarify -- MD5 was designed as a secure hash, it's just that it's since found not to be secure. A "secure" hash is one where given a particular hash it's not feasible to deliberately construct input that would lead to that hash. In some circumstances-- but not as I understand in yours-- you would need this property. You might need it, on the other hand, if the strings you're dealing with a user-input data-- i.e. a malicious user could deliberately try to confuse your system. You might also be interetsed in the following I've written in the past:

guide to hash codes
secure hash codes in Java (includes some performance measurements)

回复收藏 0 原文

手心的海 2024-09-06 19:25:59

使用 GetHashCode() 并不适合组合多个值。问题是对于字符串来说，哈希码只是一个校验和。对于相似的值来说，这留下了很少的熵。例如，为（“abc”，“bbc”）添加哈希码将与（“abd”，“abc”）相同，从而导致冲突。

如果您需要绝对确定，您可以使用真正的哈希算法，例如 SHA1、MD5 等。唯一的问题是它们是块函数，很难快速比较哈希值是否相等。相反，请尝试 CRC 或 FNV1 哈希。 FNV1 32 位非常简单：

public static class Fnv1 {
    public const uint OffsetBasis32 = 2166136261;
    public const uint FnvPrime32 = 16777619;

    public static int ComputeHash32(byte[] buffer) {
        uint hash = OffsetBasis32;

        foreach (byte b in buffer) {
            hash *= FnvPrime32;
            hash ^= b;
        }

        return (int)hash;
    }
}

Using the GetHashCode() is not ideal for combining multiple values. The problem is that for strings, the hashcode is just a checksum. This leaves little entropy for similar values. e.g. adding hashcodes for ("abc", "bbc") will be the same as ("abd", "abc"), causing a collision.

In cases where you need to be absolutely sure, you'd use a real hash algorithm, like SHA1, MD5, etc. The only problem is that they are block functions, which is difficult to quickly compare hashes for equality. Instead, try a CRC or FNV1 hash. FNV1 32-bit is super simple:

public static class Fnv1 {
    public const uint OffsetBasis32 = 2166136261;
    public const uint FnvPrime32 = 16777619;

    public static int ComputeHash32(byte[] buffer) {
        uint hash = OffsetBasis32;

        foreach (byte b in buffer) {
            hash *= FnvPrime32;
            hash ^= b;
        }

        return (int)hash;
    }
}

回复收藏 0 原文