我应该读取多少个字符串字符才能获得良好的哈希值?
这里有一个小难题:如果您使用像 CRC-64 这样的哈希算法,那么需要读取字符串中的多少字节才能计算出好的哈希值?假设您的所有字符串至少有 2 KB 长,那么使用整个字符串来计算缓存似乎是一种浪费或资源,但您认为多少个字符就足够了?由于 8 个 ASCII 字符等于 64 位,就足够了吗?使用超过 8 个 ASCII 字符不是毫无意义吗?我想知道你对此的看法。
更新: 对于“好的哈希”,我的意思是通过使用更多字节来计算哈希冲突的可能性不会减少。
Here is a little conundrum for you: If you use a hash algorithm like CRC-64 then how many bytes in a string would be necessary to read to calculate a good hash? Lets say all your strings are at least 2 KB long then it seems a waste or resources using the whole string to calculate the cache, but just how many characters do you think is enough? Would just 8 ASCII-characters be enough since it equals 64-bits? Wont using more than 8 ASCII characters just be pointless? I want to know your though on this.
Update:
With a 'good hash' I mean the point where the likelihood of hash collisions can not get any less by using even more bytes to calculate it.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您使用的 CRC-64 超过 8 个字节或更少,则使用 CRC-64 没有意义:只需“按原样”使用 8 个字节。除非输入比预期输出长,否则 CRC 没有任何附加值。
作为一般规则,如果您的哈希函数的输出为 n 位,那么一旦累积了大约 2n/2,冲突就会开始出现字符串。简而言之,如果您使用 64 位,那么在前 20 亿个字符串中不太可能遇到冲突。如果你得到 160 位或更多的输出,那么冲突实际上是不可行的(你遇到的冲突比 CPU 着火等硬件故障要少得多)。这假设哈希函数是“完美的”。如果您的哈希函数首先选择几个数据字节,那么您不选择的字节必然不会对哈希输出产生任何影响,因此您最好使用“好”字节——这完全取决于您要散列的字符串类型。这里没有一般规则。
我的建议是首先尝试在整个字符串上使用通用哈希函数;我通常推荐 MD4。 MD4是一种密码散列函数,它已经被彻底破解了,但是对于不涉及安全的问题,它仍然非常擅长混合数据元素(从密码学上来说,CRC比MD4更容易被破解)。据报道,MD4 在某些平台上实际上比 CRC-32 更快,因此您可以尝试一下。在基本 PC(我的 2.4 GHz Core2)上,MD4 实现的运行速度约为 700 MBytes/s,因此我们谈论的是每秒 35000 个散列 2 kB 字符串,这还不错。
If you use CRC-64 over 8 bytes or less then there is no point in using CRC-64: just use the 8 bytes "as is". A CRC does not have any added value unless the input is longer than the intended output.
As a general rule, if your hash function has an output of n bits then collisions begin to appear once you have accumulated about 2n/2 strings. In shorter words, if you use 64 bits, then it is very unlikely that you encounter a collision in the first 2 billions of strings. If you get a 160-bit or more output, then collisions are virtually unfeasible (you will encounter much less collisions than hardware failures such as the CPU catching fire). This assumes that the hash function is "perfect". If your hash function begins by selecting a few data bytes, then, necessarily, the bytes that you do not select cannot have any influence on the hash output, so you'd better use the "good" bytes -- which utterly depends on the kind of strings that you are hashing. There is no general rule here.
My advice would be to first try using a generic hash function over the whole string; I usually recommend MD4. MD4 is a cryptographic hash function, which has been utterly broken, but for a problem with no security involved, it is still very good at mixing data elements (cryptographically speaking, a CRC is so much more broken than MD4). MD4 has been reported to actually be faster than CRC-32 on some platforms, so you could give it a shot. On a basic PC (my 2.4 GHz Core2), a MD4 implementation works at about 700 MBytes/s, so we are talking about 35000 hashed 2 kB strings per second, which is not bad.
两个不同字符串的前 8 个字母相同的可能性有多大?根据这些字符串的内容,它可能会非常高,在这种情况下,您肯定会遇到哈希冲突。
散列整个事情。几千字节不算什么。除非您确实需要在程序中节省纳秒,否则不散列完整字符串将是过早的优化。
What are the chances that the first 8 letters of two different strings are the same? Depending on what these strings are, it could be very high, in which case you'll definitely get hash collisions.
Hash the whole thing. A few kilobytes is nothing. Unless you actually have a need to save nanoseconds in your program, not hashing the full strings would be premature optimization.