对哈希表中使用的字符串进行哈希处理(双重哈希)
我正在尝试使用双重哈希将字符串键哈希到哈希表中。我做了类似的事情:
protected int getIndex(String key) {
int itr = 0,
size = this.values.length,
index1,
index2,
index = 0;
do {
// do double hashing to get index for curr [itr] (iteration)
index1 = Math.abs(key.hashCode()) % size;
index2 = size - ((key + key + "#!@").hashCode() % size); # trying very hard to eliminate clash, but still fails ... TA and AT gets index 2 when size = 5
index = (index1 + (itr * index2)) % size;
// if itr > set threshold, exit
itr++;
if (itr > 200) {
index = -1;
break;
}
// once index found, exit loop
} while (index > 0 && this.keys[index] != null && !this.keys[index].equals(key));
return index;
}
主要部分是 do
之后的第 1 3 行。我可以说如果我使用双重哈希,它应该消除碰撞的可能性吗? size
是我的哈希表的唯一键的总可能值
I am trying to use Double Hashing to hash a String key into a hash table. I did something like:
protected int getIndex(String key) {
int itr = 0,
size = this.values.length,
index1,
index2,
index = 0;
do {
// do double hashing to get index for curr [itr] (iteration)
index1 = Math.abs(key.hashCode()) % size;
index2 = size - ((key + key + "#!@").hashCode() % size); # trying very hard to eliminate clash, but still fails ... TA and AT gets index 2 when size = 5
index = (index1 + (itr * index2)) % size;
// if itr > set threshold, exit
itr++;
if (itr > 200) {
index = -1;
break;
}
// once index found, exit loop
} while (index > 0 && this.keys[index] != null && !this.keys[index].equals(key));
return index;
}
Main part is the 1st 3 lines after the do
. Can I say if I use Double Hashing, it should eliminate the probability of collision? size
is total possible values of unique keys for my hash table
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
所以我看到这里发生了两件事
乍一看,这两种方法似乎都是好方法以减少哈希冲突。然而,经过仔细检查,这两者都陷入了真正的算法问题。
组合两个哈希
哈希算法被设计为在整数范围内分布得相当好。就像将两个随机数加在一起不会给你带来更多的随机性一样,将两个哈希值加在一起也不会给你带来更多的分布式结果。事实上,将两个相同的分布加在一起总是会得到分布不太均匀的东西。因此,使用相同底层算法的任何类型的双哈希策略都比单哈希策略更糟糕。
尝试新地点
如果第一个哈希值发生冲突,尝试使用新哈希值的算法是很诱人的。然而,这会导致算法的检索部分出现问题。当您将某些内容放入哈希中时,它会碰撞到另一个位置。然后当你去检索该值时,它不在那里。更糟糕的是,您是否找到它取决于第一个元素是否仍然存在。如果它已被删除,则无法判断您正在寻找的项目是否在更远的地方,或者它是否不在那里。最终,.contains 测试必须经过所有 200 次迭代才能确定它正在查找的哈希值不存在。
最好的解决方案是使用 Java 提供的开箱即用的哈希值。如果发生大量冲突,最好在哈希中使用较低的负载因子。这会增加桶的数量,并减少发生冲突的可能性。
So I see two things going on here
At first blush, it looks like both of these are a good way to reduce hash collisions. However, on closer inspection, both of these fall into real algorithmic trouble.
Combining two hashes
Hashing algorithms are designed to be fairly well distributed across the integer spectrum. Just like how adding two random numbers together doesn't give you anything more randomer, adding two hashes together doesn't get you something more distributed. In fact, adding two idential distributions together will ALWAYS give you something less evenly distributed. Thus any kind of double hashing strategy, using the same underlying algorithm, is worse than a single hashing strategy.
Trying a new spot
It's tempting to try an algorithm that tries a new hash if the first one collides. However, this causes problems with the retrieval part of the algorithm. When you put something in the hash, and it bumps to the another spot. Then when you go to retrieve the value, it's not there. Worse yet, whether or not you even find it depends on if the first element is still there or not. If it's been removed, then it's impossible to tell if item you're looking for is further along, or if it just isn't there. Ultimately a .contains test would have to go through all 200 iterations before it could be sure that the hash it's looking for isn't there.
The best solution is to use the out of the box hash provided by Java. If you're getting a lot of collisions, the best thing to use a lower load factor in the hash. This increases the number of buckets, and causes collisions to be less likely.