为什么哈希表的平均访问时间恒定?
我不明白这个解释,它说如果 n 是哈希表中的元素数量,m 是桶的总数,那么只有当 n 与 theta(n) 成比例时,哈希表的平均访问时间才会恒定。为什么一定要成比例?
I don't understand this explanation which says if n is the number of elements in the hash table and m is the total number of buckets then hashtables have constant access time in average only if n is proportional to theta(n). Why does it have to be proportional ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
实际上m应该与n成正比。否则,例如,您可能只有 1 个桶,它就像一个未排序的集合。
更准确地说,如果m与n成正比,即m = c * n,那么每个桶中的项目数量将为n/m = 1/c,这是一个常数。进入任何存储桶都是 O(1) 操作(只需计算哈希码),然后对存储桶的搜索是常量顺序(您可以对存储桶中的项目进行线性搜索,这将是一个常量)。
因此,如果 m = c * n,则算法的阶数为 O(1)。
举个相反的例子,假设我们有一个大小为 tableSize 的固定大小的表。那么每个桶中的预期项目数为 n/tableSize,它是 n 的线性函数。对于一棵树来说,通过存储桶进行的任何类型的搜索最多都是 O(log(n)) (我假设你没有在存储桶中粘贴另一个哈希表,或者我们对该哈希表有相同的参数),所以在这种情况下,它不会是 O(1)。
well actually m should be proportional to n. Otherwise you could, for example, have just 1 bucket and it would be just like an unsorted set.
To be more precise, if m is proportional to n, i.e. m = c * n, then the number of items in each bucket will be n/m = 1/c which is a constant. Going to any bucket is an O(1) operation (just compute the hash code) and then the search through the bucket is constant order (you could just do a linear search through the items in the bucket which would be a constant).
Thus the order of the algorithm is O(1), if m = c * n.
To take a converse example, suppose we had a fixed size table of size tableSize. Then the expected number of items in each bucket is n/tableSize which is a linear function of n. Any kind of search through the bucket is at best O(log(n)) for a tree (I'm assuming you don't stick another hash table inside the bucket or we then have the same argument over that hash table), so it would not be O(1) in this case.
严格来说,哈希表访问的平均情况时间复杂度实际上为 Ω(n1/3)。信息的传播速度不能超过光速,而光速是一个常数。由于空间具有三个维度,因此存储
n
位数据需要某些数据位于距 CPU 大约 n1/3 的位置。更多详细信息在我的博客。
Strictly speaking, the average-case time complexity of hash table access is actually in Ω(n1/3). Information can't travel faster than the speed of light, which is a constant. Since space has three dimensions, storing
n
bits of data requires that some data be located at a distance on the order of n1/3 from the CPU.More detail in my blog.
冲突的可能性更高,因此必须扫描具有相同散列键的项目列表的发生率也更高。
The chance of collisions is higher and thus the incidence of having to scan through the list of items with the same hash key is also higher.
访问时间是恒定的,因为访问基于哈希值的计算,然后不断查找以找到适当的存储桶。假设哈希函数将项目均匀分布在存储桶中,则访问任何单个项目所需的时间将等于访问其他项目的时间,无论 n 是多少。
但恒定并不一定意味着持续低。平均访问时间与哈希函数的均匀分布和桶的数量有关。如果您有数千个项目均匀分布在少数存储桶中,那么您会很快找到存储桶,但随后会循环遍历存储桶中的大量项目。如果存储桶与项目的比例很好,但散列函数不好,将更多的项目放入某些存储桶而不是其他存储桶中,则较大存储桶中的项目的访问时间将比其他存储桶的访问时间慢。
Access time is constant because access is based on a calculation of a hash value and then a constant lookup to find the appropriate bucket. Assuming the hash function evenly distributes items amongst buckets, then the time it takes to access any individual item will be equal to the time to access other items, regardless of n.
Constant doesn't necessarily mean constantly low though. The average access time is related to the even distribution of the hashing function and the number of buckets. If you have thousands of items evenly distributed amongst a small number of buckets, you're finding the bucket fast but then looping through a lot of items in the bucket. If you have a good proportion of buckets to items but a bad hash function that puts many more items in some buckets rather than other, the access time for the items in larger buckets will be slower than access time for others.
一个大小合理的哈希表,其中有足够的槽来存储每个元素,并且有足够的额外空间,哈希函数将完成大部分选择槽的工作,并且在不同元素具有相同哈希的情况下很少发生冲突。一个非常拥挤的哈希表会产生很多冲突,并且会降级为基本上线性搜索,其中几乎每次查找都将是具有相同哈希值的错误项,并且您必须继续搜索正确的项(哈希表)一旦选择了第一个槽,lookup 仍然必须检查密钥,因为它正在查找的密钥在存储时可能会发生冲突)。
决定命中碰撞率的正是项目数与哈希大小的比率(即,随机选择的槽被填充的百分比机会)。
A reasonably-sized hash table, where there are enough slots for every element you store and plenty of extra space, will have the hashing function doing most of the work choosing slots and very few collisions where different elements have the same hash. A very crowded hash table would have lots of collisions, and would degrade to basically a linear search, where almost every lookup will be a wrong item that had the same hash and you'll have to keep searching for the right one (a hash table lookup still has to check the key once it picks the first slot, because the key it's looking for might have had a collision when it was stored).
What determines the hit-collision ratio is exactly the ratio of number-of-items to size-of-hash (i.e., the percentage chance that a randomly chosen slot will be filled).