如何提高具有 100 万个元素和 997 个桶的哈希表的性能?
这是一道面试题。
假设表中有 100 万个元素和 997 个桶的无序列表。进一步假设哈希函数以等概率分布键(即每个桶有1000个元素)。
查找不在表中的元素的最坏情况时间是多少?找到表中的一个?您如何改进这一点?
我的解决方案: 查找不在表中和在表中的元素的最坏情况时间都是 O(1000)。 1000 是未排序列表的长度。
改进它: (0)简单明了,增加桶数> 100万。 (1)每个桶保存第二个哈希表,该哈希表使用不同的哈希函数来计算第二个表的哈希值。这将是 O(1) (2) 每个桶保存一棵二叉搜索树。它将是 O(lg n)。
是否可以在空间和时间之间进行权衡。两者保持在合理范围内。
还有更好的想法吗?谢谢 !
This is an interview question.
Suppose that there are 1 million elements in the table and 997 buckets of unordered lists. Further suppose that the hash function distributes keys with equal probability (i.e., each bucket has 1000 elements).
What is the worst case time to find an element which is not in the table? To find one which is in the table? How can you improve this?
My solution:
The worst case time of finding an element not in table and in table are all O(1000). 1000 is the length of the unsorted list.
Improve it :
(0) straightforward, increase bucket numbers > 1 million.
(1) each bucket holds a second hashtable, which use a different hash function to compute hash value for the second table. it will be O(1)
(2) each bucket holds a binary search tree. It will be O(lg n).
is it possible to make a trade-off between space and time. Keep both of them in a reasonable range.
Any better ideas ? thanks !
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
最简单和最明显的改进是将哈希表中的存储桶数量增加到大约 120 万个 - 至少假设您的哈希函数可以生成该范围内的数字(通常会如此)。
The simplest and most obvious improvement would be to increase the number of buckets in the hash table to something like 1.2 million -- at least assuming your hash function can generate numbers in that range (which it typically will).
显然,增加桶数可以提高性能。假设这不是一个选项(无论出于何种原因),我建议如下:
通常哈希表由桶组成,每个桶都包含一个链表(指向其头)。但是,您可以创建一个哈希表,其中的存储桶保存二叉搜索树(指向其根的指针)而不是列表。
这样您将拥有哈希表和二叉树的混合。一旦我实施了这样的事情。我对哈希表中的桶的数量没有限制,但是我从一开始就不知道元素的数量,而且我没有关于哈希函数质量的信息。因此,我创建了一个具有合理数量的桶的哈希表,其余的歧义通过二叉树解决了。
如果 N 是元素的数量,M 是桶的数量,那么在均匀分布的情况下,复杂度将增长为 O[log(N/M)]。
Obviously increasing the bucket number improves the performance. Assuming this is no an option (for whatever reason) I suggest the following:
Normally the hash table consists of buckets, each holds a linked list (points to its head). You may however create a hash table, buckets of which hold a binary search tree (pointer to its root) rather than the list.
So that you'll have a hybrid of a hash table and the binary tree. Once I've implemented such thing. I didn't have a limitation on the number of buckets in the hash table, however I didn't know the number of elements from the beginning, plus I had no information about the quality of the hash function. Hence, I created a hash table with reasonable number of buckets, and the rest of the ambiguity was solved by the binary tree.
If N is the number of elements, and M is the number of buckets, then the complexity grows as O[log(N/M)], in case of equal distribution.
如果您无法使用其他数据结构或更大的表,仍有选择:
如果活动元素集接近 1000 个而不是 1M,您可以通过将找到的每个元素移动到其列表的前面来缩短平均成功查找时间。这将使它在重复使用时能够快速找到。
类似地,如果存在一组频繁发生的未命中,您可以缓存负结果(这可以是哈希表中的一种特殊条目)。
If you can't use another data structure or a larger table there are still options:
If the active set of elements is closer to 1000 than 1M you can improve the average successful lookup time by moving each element you find to the front of its list. That will allow it to be found quickly when it is reused.
Similarly if there is a set of misses that happens frequently you can cache the negative result (this can just be a special kind of entry in the hash table).
这不太合情合理,但让我们继续运行......
丢失元素的最坏(也是最好的=唯一)情况是,您散列到一个存储桶,然后检查该特定列表中的所有元素(即 1000 个),然后失败。如果他们想要大 O 表示法,根据定义描述性能如何随元素数量 N 变化,因此我们必须假设 # 个桶如何随 N 变化:我的猜测是 997 个桶是固定数量,并且不会随着元素数量的增加而增加。因此,比较次数为 N/997,作为线性因子,它仍然是 O(N)。
不 - 你正在考虑比较的数量 - 但大 O 表示法是关于可扩展性的。
嗯,是的 - 平均冲突与条目和桶的数量有关。如果您希望很少发生冲突,则表中的条目将远远超过 100 万个,但这会浪费内存,尽管对于大型对象,您可以拥有指向实际对象的索引或指针。另一种方法是寻找更快的冲突处理机制,例如尝试从散列到存储桶的一系列偏移(使用 % 将位移映射回表大小),而不是诉诸使用链表的某些堆。重新哈希是另一种选择,因为重新冲突率较低,但通常需要更多 CPU,并且拥有任意长的良好哈希算法列表是有问题的。
哈希表中的哈希表完全没有意义,而且非常浪费内存。最好使用该空间的一小部分来减少外部哈希表中的冲突。
That doesn't quite add up, but let's run with it....
The worst (and best = only) case for missing elements is that you hash to a bucket then go through inspecting all the elements in that specific list (i.e. 1000) then fail. If they want big-O notation, by definition that describes how performance varies with the number of elements N, so we have to make an assumption about how the # buckets varies with N too: my guess is that the 997 buckets is a fixed amount, and is not going to be increased if the number of elements increases. The number of comparisons is therefore N/997, which - being a linear factor - is still O(N).
Nope - you're thinking of the number of comparisons - but big-O notation is about scalability.
Well yes - average collisions relates to the number of entries and buckets. If you want very few collisions, you'd have well over 1 million entries in the table, but that gets wasteful of memory, though for large objects you can have an index or pointer to the actual object. An alternative is to look for faster collision handling mechanisms, such as trying a series of offsets from the hashed-to bucket (using % to map the displacements back into the table size), rather than resorting to some heap using linked lists. Rehashing is another alternative, given lower re-collision rates but typically needing more CPU, and having an arbitrarily long list of good hashing algorithms is problematic.
Hash tables within hash tables is totally pointless and remarkably wasteful of memory. Much better to use a fraction of that space to reduce collisions in the outer hash table.