一个 32 位哈希与两个 16 位哈希之间的冲突率是否存在差异?
我正在开发一个哈希冲突会成为问题的系统。本质上有一个系统引用哈希表+树结构中的项目。然而,相关系统首先将包含结构中路径的文本文件编译为包含哈希值的二进制文件。这样做是出于性能原因。然而,由于这种冲突非常严重,因为该结构无法存储具有相同哈希值的 2 个项目;请求某件物品的部分没有足够的信息来知道它需要哪一件。
我最初的想法是,2 个哈希值,或者使用 2 种不同的算法,或者使用 2 种盐的相同算法两次,会更具抗碰撞性。对于不同的哈希算法,两个项目具有相同的哈希值的可能性很小。
出于空间原因,我希望将哈希值保留为 32 位,因此我认为可以改用两种 16 位算法,而不是一种 32 位算法。但这不会增加可能的散列值的范围...
我知道切换到两个 32 位散列会更具抗碰撞性,但我想知道切换到 2 个 16 位散列是否至少比单个散列有一些增益32 位哈希?我不是最擅长数学的人,所以我什至不知道如何开始检查答案,除了强制它......
系统的一些背景:
项目是由人类命名的,它们不是随机字符串,通常由单词、字母和数字组成,没有空格。它是一个嵌套的哈希结构,所以如果你有类似 { a =>; { b => {c =>; 'blah' }}} 你可以通过获取 a/b/c 的值来获取值 'blah',编译后的请求将是立即序列的 3 个哈希值,a、b 和 c 的哈希值。
只有在给定级别上发生碰撞时才会出现问题。顶层项目和下层项目之间的碰撞是可以的。你可以有 { a =>; {a =>; {...}}},几乎保证了不同级别的碰撞(不是问题)。
实际上,任何给定级别的要散列的值可能少于 100 个,并且同一级别上不会出现重复值。
为了测试我采用的散列算法(忘记了哪一个,但我没有发明它),我下载了 CPAN Perl 模块的整个列表,将所有名称空间/模块拆分为唯一的单词,最后对每个散列搜索冲突,我遇到了 0碰撞。这意味着该算法对于 CPAN 命名空间列表中的每个唯一单词都有不同的哈希值(或者我做错了)。这对我来说似乎已经足够好了,但它仍然困扰着我的大脑。
I am working on a system where hash collisions would be a problem. Essentially there is a system that references items in a hash-table+tree structure. However the system in question first compiles text-files containing paths in the structure into a binary file containing the hashed values instead. This is done for performance reasons. However because of this collisions are very bad as the structure cannot store 2 items with the same hash value; the part asking for an item would not have enough information to know which one it needs.
My initial thought is that 2 hashes, either using 2 different algorithms, or the same algorithm twice, with 2 salts would be more collision resistant. Two items having the same hash for different hashing algorithms would be very unlikely.
I was hoping to keep the hash value 32-bits for space reasons, so I thought I could switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible hash values...
I know that switching to two 32-bit hashes would be more collision resistant, but I am wondering if switching to 2 16-bit hashes has at least some gain over a single 32-bit hash? I am not the most mathematically inclined person, so I do not even know how to begin checking for an answer other than to bruit force it...
Some background on the system:
Items are given names by humans, they are not random strings, and will typically be made of words, letters, and numbers with no whitespace. It is a nested hash structure, so if you had something like { a => { b => { c => 'blah' }}} you would get the value 'blah' by getting value of a/b/c, the compiled request would be 3 hash values in immediate sequence, the hashe values of a, b, and then c.
There is only a problem when there is a collision on a given level. A collision between an item at the top level and a lower level is fine. You can have { a => {a => {...}}}, almost guaranteeing collisions that are on different levels (not a problem).
In practice any given level will likely have less than 100 values to hash, and none will be duplicates on the same level.
To test the hashing algorithm I adopted (forgot which one, but I did not invent it) I downloaded the entire list of CPAN Perl modules, split all namespaces/modules into unique words, and finally hashed each one searching for collisions, I encountered 0 collisions. That means that the algorithm has a different hash value for each unique word in the CPAN namespace list (Or that I did it wrong). That seems good enough to me, but its still nagging at my brain.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您有 2 个 16 位哈希值,它们生成不相关的值,那么您刚刚编写了一个 32 位哈希算法。这不会比任何其他 32 位哈希算法更好或更差。
如果您担心冲突,请确保您使用的哈希算法可以很好地对数据进行哈希处理(有些算法只是为了快速计算,这不是您想要的),并增加您的哈希值的大小哈希直到你感到舒服为止。
这就提出了碰撞概率的问题。事实证明,如果您的集合中有
n
个东西,则有n * (n-1) / 2
对可能发生碰撞的东西。如果您使用k
位哈希,则一对碰撞的几率为2-k
。如果你有很多东西,那么不同对碰撞的几率几乎是不相关的。这正是泊松分布描述的情况。因此,您将看到的碰撞数量应大致遵循泊松分布
λ = n * (n-1) * 2-k-1
。由此看来,没有哈希冲突的概率约为e-λ
。对于 32 位和 100 个项目,一个级别中发生冲突的几率约为百万分之一 1.1525。如果您使用足够多的不同数据集执行此操作足够多次,最终这些百万分之一的机会就会加起来。但请注意,您有许多正常尺寸的关卡和一些大型关卡,大型关卡会对碰撞风险产生不成比例的影响。这是因为添加到集合中的每个事物都可能与前面的任何事物发生碰撞 - 事物越多,碰撞的风险就越高。例如,包含 1000 个数据项的单个级别的失败几率约为万分之一 - 这与包含 100 个数据项的 100 个级别的风险大致相同。
如果散列算法不能正常工作,发生冲突的风险将会迅速上升。速度有多快很大程度上取决于故障的性质。
利用这些事实以及您对应用程序使用情况的预测,您应该能够决定您是否愿意接受 32 位哈希的风险,或者是否应该升级到更大的哈希值。
If you have 2 16 bit hashes, that are producing uncorrelated values, then you have just written a 32-bit hash algorithm. That will not be better or worse than any other 32-bit hash algorithm.
If you are concerned about collisions, be sure that you are using a hash algorithm that does a good job of hashing your data (some are written to merely be fast to compute, this is not what you want), and increase the size of your hash until you are comfortable.
This raises the question of the probability of collisions. It turns out that if you have
n
things in your collection, there aren * (n-1) / 2
pairs of things that could collide. If you're using ak
bit hash, the odds of a single pair colliding are2-k
. If you have a lot of things, then the odds of different pairs colliding is almost uncorrelated. This is exactly the situation that the Poisson distribution describes.Thus the number of collisions that you will see should approximately follow the Poisson distribution with
λ = n * (n-1) * 2-k-1
. From that the probability of no hash collisions is aboute-λ
. With 32 bits and 100 items, the odds of a collision in one level are about 1.1525 in a million. If you do this enough times, with enough different sets of data, eventually those one in a million chances will add up.But note that you have many normal sized levels and a few large ones, the large ones will have a disproportionate impact on your risk of collision. That is because each thing you add to a collection can collide with any of the preceeding things - more things equals higher risk of collision. So, for instance, a single level with 1000 data items has about 1 chance in 10,000 of failing - which is about the same risk as 100 levels with 100 data items.
If the hashing algorithm is not doing its job properly, your risk of collision will go up rapidly. How rapidly depends very much on the nature of the failure.
Using those facts and your projections for what the usage of your application is, you should be able to decide whether you're comfortable with the risk from 32-bit hashes, or whether you should move up to something larger.