好的哈希函数? (32位太小,64位太大)
我需要生成一个哈希值,用于 Java 中数十亿条记录的唯一性。问题是,我只有 16 个数字可以玩。在研究这个问题时,我发现了 32 位哈希算法,它返回 Java 整数。但这太小了,因为它的范围只有+/-20亿,而且还有更多的记录。我无法使用 64 位哈希,因为这会给我返回太大的数值(+/ 4 quintillion,或 19 位数字)。问题是,我正在处理一个遗留系统,它迫使我使用 16 位数字的静态密钥长度。
建议?我知道没有哈希函数可以保证唯一性,但我需要一个好的哈希函数来满足这些限制。
谢谢
I need to generate a hash value used for uniqueness of many billions of records in Java. Trouble is, I only have 16 numeric digits to play with. In researching this, I have found algorithms for 32-bit hash, which return Java integers. But this is too small, as it only has a range of +/ 2 billion, and have will have more records that that. I cannot go to a 64-bit hash, as that will give me numeric values back that are too large (+/ 4 quintillion, or 19 digits). Trouble is, I am dealing with a legacy system that is forcing me into a static key length of 16 digits.
Suggestions? I know no hash function will guarantee uniqueness, but I need a good one that will fit into these restrictions.
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果限制为 16 位十进制数字,则您的密钥空间包含 10^16 个值。
即使您找到了在数据集上提供均匀分布的哈希值,由于 生日悖论,您也会在大约 10^8 条数据上有 50% 的几率发生冲突,这比数十亿条记录要小一个数量级。
这意味着您不能单独使用任何类型的哈希并依赖唯一性。
一个简单的解决方案是使用全局计数器。如果全局计数器不可行,则可以使用具有预分配范围的计数器。例如,6 个最高有效数字表示固定数据源索引,10 个最低有效数字包含由该数据源维护的单调计数器。
If you are limited to 16 decimal digits, your key space contains 10^16 values.
Even if you find a hash that gives uniform distribution on your data set, due to Birthday Paradox you will have a 50% chance of collision on ~10^8 items of data, which is an order of magnitude less than your billions of records.
This means that you cannot use any kind of hash alone and rely on uniqueness.
A straightforward solution is to use a global counter instead. If global counter is infeasible, counters with preallocated ranges can be used. For example, 6 most significant digits denote fixed data source index, 10 least significant digits contain monotonous counter maintained by that data source.
如果生成的哈希太大,您可以使用密钥空间最大值对其进行修改以使其适合。
If your generated hash is too large you can just mod it with your keyspace max to make it fit.
那么你的限制是53位?
据我了解,哈希码中的位顺序数不会影响其值(位顺序和位值完全独立)。因此,您可以获得 64 位哈希函数并仅使用其中的最后 53 位。并且您必须为此使用二进制运算( hash64 & (1<<54 - 1) )而不是算术。
So your restriction is 53 bit?
For my understanding order number of bit in hashcode doesn't affect its value (order and value of bit are fully independent from each other). So you could get 64-bit hash function and use only last 53 bits from it. And you must use binary operations for this ( hash64 & (1<<54 - 1) ) not arithmetic.
您不必以人类可读的形式(十六进制,正如您所说)存储哈希值。只需将 64 位长数据类型(由 64 位哈希函数生成)存储在数据库中,该数据类型只有 8 个字节。而不是你被吓跑的 19 个字节。
如果这不是解决方案,请改进遗留系统。
编辑:等等!
64 位:264 =
16 个十六进制数字:1616 =
完全适合!因此,用十六进制表示您的 64 位哈希值,就可以了。
You don't have to store your hashes in a human readable form (hex, as you said). Just store the 64-bit long datatype (generated by a 64-bit hash function) in your database, which is only 8 bytes. And not the 19 bytes of which you were scared off.
If that isn't a solution, improve the legacy system.
Edit: Wait!
64-bit: 264 =
16 hex-digits: 1616 =
Exact fit! So make a hex representation of your 64-bit hash, and there you are.
如果您可以保存 16 个字母数字字符,那么您可以使用十六进制表示并将 16^16 位打包为 16 个字符。 16^16 是 2^64。
If you can save 16 alphanumeric characters then you can use a hexadecimal representation and pack 16^16 bits into 16 chars. 16^16 is 2^64.