四个无符号整数的哈希函数 (C++)
我现在正在编写一个程序,它生成四个无符号 32 位整数作为某个函数的输出。我想对这四个整数进行哈希处理,这样我就可以将该函数的输出与未来的输出进行比较。
不过,我在编写一个像样的哈希函数时遇到了麻烦。当我最初编写这段代码时,我对四个整数分别进行了简单的加法,但我知道这还不够。我尝试了其他几种技术,例如移位和添加,但均无济于事。我得到了一个哈希值,但它的质量很差,并且该函数会产生大量冲突。
哈希输出可以是 32 位或 64 位整数。有问题的函数会生成数十亿个哈希值,因此冲突是这里的一个真正问题,我愿意使用更大的变量来确保尽可能少的冲突。
谁能帮我弄清楚如何编写高质量的哈希函数?
I'm writing a program right now which produces four unsigned 32-bit integers as output from a certain function. I'm wanting to hash these four integers, so I can compare the output of this function to future outputs.
I'm having trouble writing a decent hashing function though. When I originally wrote this code, I threw in a simple addition of each of the four integers, which I knew would not suffice. I've tried several other techniques, such as shifting and adding, to no avail. I get a hash, but it's of poor quality, and the function generate a ton of collisions.
The hash output can be either a 32-bit or 64-bit integer. The function in question generates many billions of hashes, so collisions are a real problem here, and I'm willing to use a larger variable to ensure that there are as few collisions as possible.
Can anyone help me figure out how to write a quality hash function?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
为什么不将这四个整数存储在合适的数据结构中并比较它们呢?在这种情况下对它们进行哈希处理的好处对我来说似乎很可疑,除非存储是一个问题。
如果存储是问题,您可以使用此处分析的哈希函数之一。
Why don't you store the four integers in a suitable data structure and compare them all? The benefit of hashing them in this case appears dubious to me, unless storage is a problem.
If storage is the issue, you can use one of the hash functions analyzed here.
这是一个相当合理的哈希函数,从 4 个整数到 1 个整数:
使用均匀分布的输入,它给出均匀分布的输出。输入的所有位都参与输出,并且每个输入值(尽管不是每个输入位)都可以影响每个输出位。它很可能比产生输出的函数更快,在这种情况下没有性能问题。
还有其他具有其他特征的哈希值,但除非另有证明,否则累积与素数相乘是一个好的开始。如果您愿意,可以尝试使用异或而不是加法进行累加。无论哪种方式,都很容易产生冲突(例如,对于所有 a、b,{1, 0, a, b} 与 {0, 37, a, b} 发生冲突),因此您可能需要选择一个您认为具有的质数与您的函数中任何可能的实现错误无关。因此,如果您的函数中有大量模 37 算术,则可以使用 1000003 代替。
Here's a fairly reasonable hash function from 4 integers to 1 integer:
With uniformly-distributed input it gives uniformly-distributed output. All bits of the input participate in the output, and every input value (although not every input bit) can affect every output bit. Chances are it's faster than the function which produces the output, in which case no performance concerns.
There are other hashes with other characteristics, but accumulate-with-multiplication-by-prime is a good start until proven otherwise. You could try accumulating with xor instead of addition if you like. Either way, it's easy to generate collisions (for example {1, 0, a, b} collides with {0, 37, a, b} for all a, b), so you might want to pick a prime which you think has nothing to do with any plausible implementation bug in your function. So if your function has a lot of modulo-37 arithmetic in it, maybe use 1000003 instead.
由于散列可能会产生冲突,因此您必须将密钥保留在内存中才能发现这些冲突。哈希图和其他标准数据结构在其内部簿记中确实做到了这一点。
由于密钥很小,所以直接使用密钥而不是散列。这会更快并且确保不会发生碰撞。
Because hashing can generate collisions, you have to keep the keys in memory anyway in order to discover these collisions. Hashmaps and other standard datastructures do do this in their internal bookkeeping.
As the key is so small, just use the key directly rather than hashing. This will be faster and will ensure no collisions.
我完全同意 Vinko 的观点——只需比较它们即可。如果您仍然想要一个好的哈希函数,您需要分析 4 个无符号整数的分布。然后,您必须以某种方式设计哈希函数,使结果均匀分布在 32 位哈希值的整个范围内。
一个简单的例子 - 让我们假设大多数时候,每个函数的结果在 0 到 255 的范围内。然后您可以轻松地将每个函数的低 8 位混合到哈希中。大多数时候,您会直接找到结果,只是有时(当一个函数返回更大的结果时)您会发生冲突。
总而言之 - 如果不知道 4 个函数的结果如何分布,我们就无法帮助您提供良好的哈希函数。
I fully agree with Vinko - just compare them all. If you still want a good hashing function, you need to analyse the distribution of your 4 unsinged integers. Then you have to craft your hashing function in a way, that the result will be even distributed over the whole range of the 32 bit hashing value.
A simple example - let's just assume that most of the time, the result from each function is in the range from 0 to 255. Then you could easily blend the lower 8 bits from each function into your hash. Most of the time, you'd finde the result directly, just sometimes (when one function returns a larger result) you'd have a collision.
To sum it up - without information how the results of the 4 functions are distributed, we can't help you with a good hashing function.
为什么是哈希?看起来 std::set 或 std::multi 集更适合存储这种输出。您需要做的就是将四个整数包装在一个结构中并编写一个简单的比较函数。
Why a hash? It seems like a std::set or std::multi set would be better suited to store this kind of output. All you'd need to do is wrap the four integers up in a struct and write a simple compare function.
尝试使用 CRC 或 FNV。 FNV 很好,因为它速度快,并且具有定义的折叠位方法以获得“更小的”哈希值(即 12 位/24 位/等)。
此外,从 128 位(4 X 32 位)数字生成 64 位哈希的好处有点值得怀疑,因为正如其他人所建议的那样,您可以仅使用原始值作为集合中的键。您确实希望散列中的位数代表您最初拥有的值的数量。例如,如果您的数据集有 100,000 个 4X32 位值,您可能需要 17 位或 18 位哈希值,而不是 64 位哈希值。
Try using CRC or FNV. FNV is nice because it is fast and has a defined method of folding bits to get "smaller" hash values (i.e. 12-bit / 24-bit / etc).
Also the benefit of generating a 64-bit hash from a 128-bit (4 X 32-bit) number is a bit questionable because as other people have suggested, you could just use the original value as a key in a set. You really want the number of bits in the hash to represent the number of values you originally have. For example, if your dataset has 100,000 4X32-bit values, you probably want a 17-bit or 18-bit hash value, not a 64-bit hash.
可能有点矫枉过正,但请考虑 Boost.Hash。生成非常简单的代码和良好的值。
Might be a bit overkill, but consider Boost.Hash. Generates very simple code and good values.