将 n 个元素插入到空哈希表中的运行时间
人们说放入哈希表需要摊销 O(1)。 因此,放入n个元素一定是O(n)。 然而,对于大 n 来说情况并非如此,因为正如回答者所说,“满足预期摊销 O(1) 所需的只是扩展表并在发生冲突时使用新的随机哈希函数重新哈希所有内容。”
那么:向哈希表插入 n 个元素的平均运行时间是多少? 我意识到这可能取决于实现,因此请提及您正在谈论的实现类型。
例如,如果存在 (log n) 个等间隔的碰撞,并且每次碰撞需要 O(k) 才能解决,其中 k 是哈希表的当前大小,那么您将具有以下递归关系:(
T(n) = T(n/2) + n/2 + n/2
即,您采用插入 n/2 个元素的时间,然后发生碰撞,需要 n/2 来解决,然后执行剩余的 n/2 插入而不发生碰撞)。 这最终仍然是 O(n),所以是的。 但这合理吗?
People say it takes amortized O(1) to put into a hash table. Therefore, putting n elements must be O(n). That's not true for large n, however, since as an answerer said, "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision."
So: what is the average running-time of inserting n elements into a hash table? I realize this is probably implementation-dependent, so mention what type of implementation you're talking about.
For example, if there are (log n) equally spaced collisions, and each collision takes O(k) to resolve, where k is the current size of the hashtable, then you'd have this recurrence relation:
T(n) = T(n/2) + n/2 + n/2
(that is, you take the time to insert n/2 elements, then you have a collision, taking n/2 to resolve, then you do the remaining n/2 inserts without a collision). This still ends up being O(n), so yay. But is this reasonable?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这完全取决于你的重新哈希效率有多低。 具体来说,如果您可以第二次正确估计哈希表的预期大小,那么您的运行时间仍然接近 O(n)。 实际上,您必须先指定重新哈希大小计算的效率有多低,然后才能确定预期的顺序。
It completely depends on how inefficient your rehashing is. Specifically, if you can properly estimate the expected size of your hashtable the second time, your runtime still approaches O(n). Effectively, you have to specify how inefficient your rehash size calculation is before you can determine the expected order.
从理论角度来看,预期摊销为 O(1)。
哈希表本质上是一种随机数据结构,就像快速排序是一种随机算法一样。 您需要生成具有一定随机性的哈希函数,否则会存在非 O(1) 的病态输入。
您可以使用 动态完美哈希 实现预期摊销 O(1):
我最初发布的天真的想法是在每次碰撞时使用新的随机散列函数重新散列。 (另请参阅完美哈希函数)问题是这需要 O(n^2 ) 空间,来自生日悖论。
解决方案是有两个哈希表,第二个表用于冲突; 通过重建第二个表来解决该表上的冲突。 该表将有 O(\sqrt{n}) 个元素,因此将增长到 O(n) 大小。
在实践中,您通常只使用固定的哈希函数,因为您可以假设(或不关心)您的输入是病态的,就像您经常在不预先随机化输入的情况下进行快速排序一样。
From a theoretical standpoint, it is expected amortized O(1).
Hash tables are fundamentally a randomized data structure, in the same sense that quicksort is a randomized algorithm. You need to generate your hash functions with some randomness, or else there exist pathological inputs which are not O(1).
You can achieve expected amortized O(1) using dynamic perfect hashing:
The naive idea I originally posted was to rehash with a new random hash function on every collision. (See also perfect hash functions) The problem with this is that this requires O(n^2) space, from birthday paradox.
The solution is to have two hash tables, with the second table for collisions; resolve collisions on that second table by rebuilding it. That table will have O(\sqrt{n}) elements, so would grow to O(n) size.
In practice you often just use a fixed hash function because you can assume (or don't care if) your input is pathological, much like you often quicksort without prerandomizing the input.
O(1) 的意思是,该操作是在恒定时间内执行的,并且它不依赖于数据结构中的元素数量。
简而言之,这意味着无论您的数据结构有多大,您都必须支付相同的成本。
实际上,这意味着当您不必存储大量数据时,简单的数据结构(例如树)通常更有效。 根据我的经验,我发现树的速度最多可达 1k 个元素(32 位整数),然后是哈希表。 但像往常一样YMMW。
All O(1) is saying is that the operation is performed in constant time, and it's not dependent on the number of elements in your data structure.
In simple words, this means that you'll have to pay the same cost no matter how big your data structure is.
In practical terms this means that simple data structures such as trees are generally more effective when you don't have to store a lot of data. In my experience I find trees faster up to ~1k elements (32bit integers), then hash tables take over. But as usual YMMW.
为什么不在您的系统上运行一些测试呢? 也许如果您发布源代码,我们可以回去并在我们的系统上测试它们,我们真的可以将其形成一个非常有用的讨论。
决定算法实际花费多少时间的不是实现,而是环境。 但是,您可以查看是否有可用的基准测试示例。 我发布结果的问题是没有用的,因为人们不知道我的系统上还运行着什么,现在有多少 RAM 是可用的等等。 你只能有一个广泛的想法。 这和大 O 给你的效果差不多。
Why not just run a few tests on your system? Maybe if you'll post the source, we can go back and test them on our systems and we could really shape this into a very useful discussion.
It is just not the implementation, but the environment as well that decides how much time the algorithm actually takes. You can however, look if any benchmarking samples are available or not. The problem with me posting my results will be of no use since people have no idea what else is running on my system, how much RAM is free right now and so on. You can only ever have a broad idea. And that is about as good as what the big-O gives you.