假设简单的统一散列,即任何给定值都同样会散列到散列的任何槽中。为什么使用 127 尺寸的桌子而不是 128 尺寸的桌子更好?我实在不明白2的幂有什么问题。或者说它实际上有何不同。
当使用除法时,
我们通常会避免某些价值观
米(桌子尺寸)。例如,米
不应该是 2 的幂,因为如果 m
= 2^p ,则 h(k) 就是 k 的 p 个最低位。
假设可能的元素仅在 1 到 10000 之间,并且我选择表大小为 128。127 怎样才能更好呢?
所以 128 是 2^6 (1000000),127 是 0111111。这有什么区别呢?对于 127,所有数字(散列后)仍将是 k 的 p 最低位。我是不是搞错了什么?
我正在寻找一些例子,因为我真的不明白为什么这很糟糕。预先非常感谢!
PS:我知道:
哈希表:为什么大小应该是素数?
Supposing simple uniform hashing, that being, any given value is equally like to hash into any of the slots of the hash. Why is it better to use a table of size 127 and not 128? I really don't understand what's the problem with the power of 2 numbers. Or how it actually makes any difference at all.
When using the division method,
we usually avoid certain values
of m (table size). For example, m
should not be a power of 2, since if m
= 2^p , then h(k) is just the p lowest-order bits of k.
Let's suppose the possible elements are only between 1 and 10000 and I picked the table size as 128. How can 127 be better?
So 128 is 2^6 (1000000) and 127 is 0111111. What difference does this make? All numbers (when hashed) are still going to be the p lowest-order bits of k for 127 too. Did I get something wrong?
I'm looking for some examples as I really can't understand why is this bad. Thanks a lot in advance!
PS: I am aware of:
Hash table: why size should be prime?
发布评论
评论(9)
那是错误的(或者我误解了..)。
k % 127
取决于 k 的所有位。k % 128
仅取决于最低 7 位。编辑:
如果你的完美分布在 1 到 10,000 之间。
10,000 % 127
和10,000 % 128
都将把它变成一个优秀的较小的分布。所有桶将包含 10,000 /128 = 78(或 79)个物品。如果分布在 1 到 10,000 之间,则该分布存在偏差,因为 {x, 2x, 3x, ..} 出现的频率更高。然后,素数大小将提供更好的分布,如
因此,只要低位的分布足够好,切断高位(使用大小 128)就没有任何问题。但是,对于真实数据和设计糟糕的哈希函数,您将需要这些高位。
That is wrong (or I misunderstood..).
k % 127
depends on all bits of k.k % 128
only depends on the 7 lowest bits.EDIT:
If you have a perfect distribution between 1 and 10,000.
10,000 % 127
and10,000 % 128
both will turn this in a excellent smaller distribution. All buckets will contain 10,000 /128 = 78 (or 79) items.If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.)
Thus, cutting off the high bits (using a size of 128) is no problem whatsoever if the distribution in the lower bits is good enough. But, with real data and real badly designed hash functions, you will need those high bits.
划分方法
了解为什么
m = 2p
仅使用k
的p
最低位>,您必须首先了解模哈希函数h(k) = k % m
密钥可以用商
q
和余数来表示 。 r
选择商为
q = m
让我们可以将k % m
简单地写为上式中的余数:因此,
k % m
相当于连续减去m
总共n
次(直到r):
让我们尝试对键
k = 91
和m = 24 = 16
因此,
91 % 24 =。 11
只是91
的二进制形式,仅保留p=4
最低位重要区别:
这特别适用于。哈希的除法。事实上,对于 CLRS 中所述的乘法来说,情况正好相反:
Division Method
To understand why
m = 2p
uses only thep
lowest bits ofk
, you must first understand the modulo hash functionh(k) = k % m
.The key can be written in terms of a quotient
q
, and remainderr
.Choosing the quotient to be
q = m
allows us to writek % m
simply as the remainder in the above equation:Therefore,
k % m
is equivalent to continuously subtractingm
a total ofn
times (untilr < m
):Lets try hashing the key
k = 91
withm = 24 = 16
.Thus,
91 % 24 = 11
is just the binary form of91
with only thep=4
lowest bits remaining.Important Distinction:
This pertains specifically to the division method of hashing. In fact, the converse is true for the multiplication method as stated in CLRS:
尼克是对的,一般来说,哈希表的大小并不重要。然而,在使用开放寻址和双重散列的特殊情况下(其中探测之间的间隔由另一个散列函数计算),则素数大小的散列表最好确保所有哈希表条目均可用于新元素(如 Corkscreewe 提到的。)
Nick is right that in general, the hash table size doesn't matter. However, in the special case where open addressing with double hashing is used (in which the interval between probes is computed by another hash function) then a prime number-sized hash table is best to ensure that all hash table entries are available for a new element (as Corkscreewe mentioned.)
首先,这不是选择一个质数。对于您的示例,如果您知道数据集的范围为 1 到 10,000,那么选择 127 或 128 不会产生任何影响,因为这是一个糟糕的设计选择。
相反,最好为您的示例选择一个非常大的素数,例如 3967,以便每个数据都有自己唯一的键/值对。您只是想尽量减少碰撞。为您的示例选择 127 或 128 不会产生任何影响,因为所有 127/128 存储桶都将被均匀填充(这很糟糕,并且会降低插入和查找运行时间 O(1) 到 O(n)),而不是 3967 (这将保留 O(1) 运行时间)
编辑#4
First off, it's not about picking a prime number. For your example, if you know your data set will be in the range 1 to 10,000, picking 127 or 128 won't make a difference bc it's a poor design choice.
Rather, it's better to pick a REALLY large prime like 3967 for your example so that each data will have its own unique key/value pair. You just want to also minimize collisions. Picking 127 or 128 for your example won't make a difference bc all 127/128 buckets will be uniformly filled (this is bad and will degrade the insertion and lookup run time O(1) to O(n)) as opposed to 3967 (which will preserve the O(1) run times)
EDIT #4
如果你有一个均匀分布的完美哈希函数,那么这并不重要。
If you have a perfect hash function that has an even distribution, then it doesn't matter.
维基百科实际上对此有一个很好的总结:
http://en.wikipedia.org/wiki/Hash_table
他们指出,某些哈希函数被设计为仅适用于素数。本文解释了为什么二的幂不好:
http://www.concentric.net/ ~Ttwang/tech/primehash.htm
Wikipedia actually has a good summary of this:
http://en.wikipedia.org/wiki/Hash_table
They point out that some hash functions are designed to operate ONLY with prime numbers. This article explains why powers of two are bad:
http://www.concentric.net/~Ttwang/tech/primehash.htm
我无法再证明这一点,尽管我记得在一百万年前的大学考试中必须这样做,但最佳哈希大小不仅仅是素数。您想要选择一个质数N,使得
N = 4*M − 1
(其中M也是一个整数)。这使得 31 个桶的数量比 29 个更好。当 N 为 31 时,M 为 8,但当 N 时,没有整数 M N 是 29。
正如我所说,我不再记得证明这一点的数学。这是大约 25 年前 Udi 的妻子 Rachel Manber 教授的理论课程中的内容。
I cannot prove it anymore, although I remember having to do so in an exam at university a million years ago, but optimal hash sizes are not merely prime. You want to pick a prime number N such that
N = 4*M − 1
(where M is also an integer).That makes 31 a better number of buckets than 29. M is 8 when N is 31, but there is no integral M when N is 29.
As I said, I no longer remember the math to prove this. It was in a theory course taught by Rachel Manber, Udi’s wife, about 25 years ago or so.
这是一种理解“k % 127 取决于 k 的所有位。k % 128 仅取决于 7 个最低位”的方法。 .
k % 128 等于 k & (2^7-1) 。例如: 129 % 128 = 1 ,二进制: 1000 0001 & 0111 1111 =0000 0001,(2^7-1)的任何高位都将为0,这意味着高位是多少并不重要。但此翻译对于不等于 2^n 的数字无效。
现在我们看一下十进制 129 % 127 是如何做除法的,先看最高位置 1,小于 127,然后得到下一项 2 与拳头组合得到 12,12 小于 127,然后组合9 表示 129 ,除以 127 余数为 2,我们可以用数学写成:129 = 1 * 127 +2 ,所以我们得到 2 [所有这些都称为 Long_division] ,在二进制除法中也是一样,现在,我们知道 k % 127 取决于 k 的所有位
here is a way to understand " k % 127 depends on all bits of k. k % 128 only depends on the 7 lowest bits." .
k % 128 is equals to k & (2^7-1) .for example: 129 % 128 = 1 , In Binary: 1000 0001 & 0111 1111 =0000 0001,any hight bit of (2^7-1) will be 0 ,which means it dose not matter whats the high position is. but this translate is invalid for numbers which are not equals 2^n.
now let's take a look at how we do division in Decimal 129 % 127 ,first look at the highest position 1,less than 127,then we get the next item 2 combine with fist we get 12 , 12 is less than 127,then combine with 9 which means 129 ,divided by 127 the remainder is 2,we could write this in math:129 = 1 * 127 +2 ,so we got 2 [all of this is called Long_division] ,and it's the same in Binary division,now ,we know k % 127 depends on all bits of k
来自 为什么哈希表应该使用素数大小.
From Why hash tables should use a prime-number size.