寻找中等强度的哈希函数
我有一组静态的约 35000 个唯一的 ASCII 文本字符串,每个字符串从 20 到 60 个字节。我想在其中引入一个唯一索引。由于各种原因,简单地编号是不可取的。
像 MD5 这样的加密级函数工作得很好,但我觉得这些有点矫枉过正了。这最终是为了一个移动项目,所以我对存储和 CPU 周期都有点贪婪。另一方面,我尝试过 32 位 Adler32 并遇到了冲突。
谁能想到一个好的哈希函数来生成 64 位值?
I have a static set of ~35000 unique ASCII text strings from 20 to 60 bytes each. I want to introduce a unique index in them. Simply numbering would be undesirable for various reasons.
Crypto-grade functions like MD5 work fine, but I feel those are an overkill. This is ultimately for a mobile project, so I'm kinda greedy on both storage and CPU cycles. On the other hand, I've tried 32-bit Adler32 and got collisions.
Can anyone think of a good hash function that produces a 64-bit value?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
由于您拥有的字符串集是固定的,因此您应该尝试寻找完美的哈希函数,专门针对一组数据设计的哈希函数,以保证不会发生冲突。有许多工具可用于创建此类哈希函数,其中之一
gperf
< /a> (不要与gprof
混淆)我知道它是免费的。我强烈建议这样做。如果您后来最终需要更改字符串集并想要一个轻量级、简单的哈希函数,您可能需要考虑使用 Rabin-Karp 滚动哈希函数。它可以使用 O(n) 次加法、乘法和取模来计算长度为 n 的字符串,并确保每两个字符串具有成对独立的哈希值。此外,您可能可以在大约半小时内对其进行编码,以测试它的性能是否比阿德勒校验和更好。
也就是说,如果您不尝试实现加密安全性,那么使用 MD5 等众所周知的哈希函数可能仍然是一个好主意。在这种情况下,即使是简单的 CRC32 也可能足够了。
Because the set of strings that you have is fixed, you should try looking for a perfect hash function, a hash function specifically designed over a set of data to guarantee no collisions occur. There are many tools for creating hash functions like these, one of which,
gperf
(not to be confused withgprof
) I know is freely available. I would strongly suggest this.If you later end up needing to change the set of strings and want a lightweight, simple hash function, you may want to consider using the Rabin-Karp rolling hash function. It can be computed for a string of length n using O(n) additions, multiplications, and moduli, and ensures that each two strings have pairwise independent hash values. Moreover, you could probably code this up in about half an hour to test whether or not it performs better than the Adler checksum.
That said, using a well-known hash function like MD5 is still probably a good idea if you aren't trying to achieve cryptographic security. Even a simple CRC32 might be sufficient in that case.
鉴于从 64 位到 128 位,冲突的可能性大大降低,我强烈考虑使用 MD5128。
因此,对于 35000 (3.5e4) 字符串和 64 位哈希,这会为您提供 10e^-12 和 10e^-9 之间发生冲突的机会。这可能看起来不是很高,但是当涉及到散列时,十亿分之一是很容易达到的。
通过增加到 128 位,您的数值将大大低于 1 分之一(十亿 * 十亿)。
Given the fact that the likelihood of collisions decreases so much by going from 64 bit to 128 bit, I would strongly consider going with MD5128.
So with 35000 (3.5e4) string, with a 64 bit hash, this gives you something between a 10e^-12 and 10e^-9 chance to have a collision. This might not seem very high, but when it comes to hashing, 1 in a billion is pretty easy to hit.
By increasing to 128 bit, you go down to considerably less than 1 in a (billion * billion).
我认为您可以连接两个不同的 32 位哈希函数的值以获得 64 位哈希。
为了获得四个不同的哈希函数,我将使用一个预处理步骤,以某种不与哈希函数中的值交换的方式更改哈希函数的输入。一种方法是使用 256 字节查找表对字节重新编号。另一种可能是将每个字节乘以 X mod 257,用 -X mod 257 替换任何产生 256 = -1 mod 257 的内容,因为否则不会发生这种情况。请注意,(a*256 + b) mod 257 是 a + b mod 257。
I think you could concatenate the values of two different 32-bit hash functions to get a 64-bit hash.
To get four different hash functions I would use a pre-processing step that alters the input to the hash function in some way that does not commute with the values in the hash function. One way would be to use a 256-byte lookup table to renumber the bytes. Another might be to multiply each byte by X mod 257, replacing anything that yields 256 = -1 mod 257 by -X mod 257, because that won't otherwise occur. Note that (a*256 + b) mod 257 is a + b mod 257.
FWIW 有一个非安全哈希函数,具有很好的保证。举个例子,选择一个素数并以该数为模进行所有计算,这会给出一个数学域。将数据切成以素数为模的数字序列,并将它们视为多项式的系数。除了为哈希函数选择模数之外,您还可以选择一个数字 x mod 素数,然后计算该 x 处的多项式。理论上x是随机选择的。
如果两个消息的多项式之差为零,则这两个消息映射到相同的值,这意味着所选的 x 是该多项式的根。 N 次多项式最多有 N 个根,所以在你的情况下 - 如果你有很短的字符串并选择一个大的模数 - 这不是一个坏的保证。我认为如果您加密该计算的结果,我认为这是获得安全哈希函数的更快方法。我认为它应该比 MD5 更快,因为尽管对 128 位素数进行算术模运算很昂贵,但有人认为它比 MD5 便宜。
FWIW there is a non-secure hash function with quite a good guarantee. As an example, pick a prime number and do all your calculations modulo that number, which gives you a mathematical field. Chop your data up into a sequence of numbers modulo that prime, and treat them as the coefficients of a polynomial. As well as picking the modulus for your hash function you pick a number x mod the prime, and then evaluate the polynomial at that x. In theory x is picked at random.
Two messages map to the same value if the difference of their polynomials is zero, which means that the chosen x is a root of that polynomial. A polynomial of degree N has at most N roots, so in your case - if you have quite short strings and pick a large modulus - that's not a bad guarantee. I think I saw this suggested as a quicker way to get a secure hash function if you encrypt the result of this calculation. I think it was supposed to be faster than MD5 because even though doing arithmetic modulo 128-bit primes is expensive, somebody reckoned it was cheaper than doing MD5.
已采用 64 位 MurmurHash64B。听起来像“purry”的名字加分。
Settled on 64-bit MurmurHash64B. Extra points for the purry sounding name.