Java hashmap 搜索真的是 O(1) 吗?

发布于 2024-07-25 11:24:13 字数 248 浏览 7 评论 0原文

我看到了一些关于 SO re Java hashmap 及其 O(1) 查找时间的有趣声明。 有人可以解释为什么会这样吗? 除非这些哈希图与我购买的任何哈希算法有很大不同,否则必须始终存在包含冲突的数据集。

在这种情况下,查找时间将是 O(n) 而不是 O(1)

有人可以解释一下它们是否 O(1),如果是的话,他们是如何实现这一点的?

I've seen some interesting claims on SO re Java hashmaps and their O(1) lookup time. Can someone explain why this is so? Unless these hashmaps are vastly different from any of the hashing algorithms I was bought up on, there must always exist a dataset that contains collisions.

In which case, the lookup would be O(n) rather than O(1).

Can someone explain whether they are O(1) and, if so, how they achieve this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

清君侧 2024-08-01 11:24:14

HashMap 的一个特殊功能是,与平衡树不同,它的行为是概率性的。 在这些情况下,通常最有帮助的是根据最坏情况事件发生的概率来讨论复杂性。 对于哈希映射,这当然是映射的满度发生冲突的情况。 碰撞很容易估计。

p碰撞 = n / 容量

因此,即使元素数量适中的哈希映射也很可能至少经历一次碰撞。 大 O 表示法允许我们做一些更引人注目的事情。 对于任何任意的固定常数 k,观察这一点。

O(n) = O(k * n)

我们可以利用这个特性来提高哈希图的性能。 我们可以考虑最多 2 次碰撞的概率。

p碰撞 x 2 =(n / 容量)2

这要低得多。 由于处理一次额外碰撞的成本与 Big O 性能无关,因此我们找到了一种无需实际更改算法即可提高性能的方法! 我们可以将其概括为

p碰撞 x k =(n / 容量)k

现在我们可以忽略任意数量的碰撞,最终导致碰撞次数比我们所考虑的要小得多的可能性。 通过选择正确的 k,您可以将概率提高到任意微小的水平,而无需改变算法的实际实现。

我们通过说哈希映射具有 O(1) 访问高概率来讨论这一点

A particular feature of a HashMap is that unlike, say, balanced trees, its behavior is probabilistic. In these cases its usually most helpful to talk about complexity in terms of the probability of a worst-case event occurring would be. For a hash map, that of course is the case of a collision with respect to how full the map happens to be. A collision is pretty easy to estimate.

pcollision = n / capacity

So a hash map with even a modest number of elements is pretty likely to experience at least one collision. Big O notation allows us to do something more compelling. Observe that for any arbitrary, fixed constant k.

O(n) = O(k * n)

We can use this feature to improve the performance of the hash map. We could instead think about the probability of at most 2 collisions.

pcollision x 2 = (n / capacity)2

This is much lower. Since the cost of handling one extra collision is irrelevant to Big O performance, we've found a way to improve performance without actually changing the algorithm! We can generalzie this to

pcollision x k = (n / capacity)k

And now we can disregard some arbitrary number of collisions and end up with vanishingly tiny likelihood of more collisions than we are accounting for. You could get the probability to an arbitrarily tiny level by choosing the correct k, all without altering the actual implementation of the algorithm.

We talk about this by saying that the hash-map has O(1) access with high probability

兮子 2024-08-01 11:24:14

您似乎将最坏情况的行为与平均情况(预期)的运行时间混合在一起。 对于一般的哈希表来说,前者确实是 O(n)(即不使用完美的哈希),但这在实践中很少相关。

任何可靠的哈希表实现,加上半像样的哈希,在预期情况下都具有 O(1) 的检索性能,并且在非常小的方差范围内,因子非常小(实际上是 2)。

You seem to mix up worst-case behaviour with average-case (expected) runtime. The former is indeed O(n) for hash tables in general (i.e. not using a perfect hashing) but this is rarely relevant in practice.

Any dependable hash table implementation, coupled with a half decent hash, has a retrieval performance of O(1) with a very small factor (2, in fact) in the expected case, within a very narrow margin of variance.

无力看清 2024-08-01 11:24:14

在Java中,HashMap是如何工作的?

  • 使用hashCode定位相应的bucket[bucket容器模型内部]。
  • 每个存储桶都是驻留在该存储桶中的项目的LinkedList(或平衡红黑二叉树在某些条件下从 Java 8 开始)。
  • 一项一项扫描,使用equals进行比较。
  • 添加更多项目时,一旦达到一定的负载百分比,HashMap 就会调整大小(大小加倍)。

因此,有时它必须与几个项目进行比较,但一般来说,它比 O(n) / O(log n)
出于实际目的,这就是您需要知道的全部内容。

In Java, how HashMap works?

  • Using hashCode to locate the corresponding bucket [inside buckets container model].
  • Each bucket is a LinkedList (or a Balanced Red-Black Binary Tree under some conditions starting from Java 8) of items residing in that bucket.
  • The items are scanned one by one, using equals for comparison.
  • When adding more items, the HashMap is resized (doubling the size) once a certain load percentage is reached.

So, sometimes it will have to compare against a few items, but generally, it's much closer to O(1) than O(n) / O(log n).
For practical purposes, that's all you should need to know.

深爱成瘾 2024-08-01 11:24:14

请记住,o(1) 并不意味着每次查找仅检查单个项目 - 它意味着检查的平均项目数与容器中的项目数保持不变。 因此,如果平均需要 4 次比较才能在包含 100 个项目的容器中找到某个项目,那么平均也应该需要 4 次比较才能在包含 10000 个项目的容器中找到一个项目,对于任何其他数量的项目(总有一个有点差异,特别是在哈希表重新散列的点附近,以及当项目数量非常少时)。

因此,只要每个存储桶的平均键数保持在固定范围内,冲突就不会阻止容器进行 o(1) 操作。

Remember that o(1) does not mean that each lookup only examines a single item - it means that the average number of items checked remains constant w.r.t. the number of items in the container. So if it takes on average 4 comparisons to find an item in a container with 100 items, it should also take an average of 4 comparisons to find an item in a container with 10000 items, and for any other number of items (there's always a bit of variance, especially around the points at which the hash table rehashes, and when there's a very small number of items).

So collisions don't prevent the container from having o(1) operations, as long as the average number of keys per bucket remains within a fixed bound.

人生百味 2024-08-01 11:24:14

我知道这是一个老问题,但实际上有一个新的答案。

你是对的,严格来说,哈希映射并不是真正的O(1),因为随着元素数量变得任意大,最终你将无法在恒定时间内进行搜索(并且O 表示法是根据可以变得任意大的数字来定义的)。

但这并不意味着实时复杂度是O(n)——因为没有规则表明存储桶必须实现为线性列表。

事实上,一旦超过阈值,Java 8 就会将存储桶实现为 TreeMap,这使得实际时间 O(log n)

I know this is an old question, but there's actually a new answer to it.

You're right that a hash map isn't really O(1), strictly speaking, because as the number of elements gets arbitrarily large, eventually you will not be able to search in constant time (and O-notation is defined in terms of numbers that can get arbitrarily large).

But it doesn't follow that the real time complexity is O(n)--because there's no rule that says that the buckets have to be implemented as a linear list.

In fact, Java 8 implements the buckets as TreeMaps once they exceed a threshold, which makes the actual time O(log n).

若言繁花未落 2024-08-01 11:24:14

O(1+n/k),其中k是桶的数量。

如果实现设置k = n/alpha,那么它是O(1+alpha) = O(1),因为alpha是一个常量。

O(1+n/k) where k is the number of buckets.

If implementation sets k = n/alpha then it is O(1+alpha) = O(1) since alpha is a constant.

以可爱出名 2024-08-01 11:24:14

如果存储桶的数量(称为 b)保持不变(通常情况),则查找实际上是 O(n)。

随着n变大,每个桶中的元素数量平均为n/b。 如果以通常的方式之一完成冲突解决(例如链表),则查找的时间复杂度为 O(n/b) = O(n)。

O 表示法是关于当 n 越来越大时会发生什么。 当应用于某些算法时,它可能会产生误导,哈希表就是一个很好的例子。 我们根据期望处理的元素数量来选择存储桶的数量。 当 n 与 b 的大小大致相同时,查找大致是恒定时间的,但我们不能将其称为 O(1),因为 O 是根据 n → ∞ 的极限来定义的。

If the number of buckets (call it b) is held constant (the usual case), then lookup is actually O(n).

As n gets large, the number of elements in each bucket averages n/b. If collision resolution is done in one of the usual ways (linked list for example), then lookup is O(n/b) = O(n).

The O notation is about what happens when n gets larger and larger. It can be misleading when applied to certain algorithms, and hash tables are a case in point. We choose the number of buckets based on how many elements we're expecting to deal with. When n is about the same size as b, then lookup is roughly constant-time, but we can't call it O(1) because O is defined in terms of a limit as n → ∞.

笨笨の傻瓜 2024-08-01 11:24:14

HashMap 中的元素存储为一个链表(节点)数组,数组中的每个链表代表一个存储桶,用于存储一个或多个键的唯一哈希值。
在 HashMap 中添加条目时,键的哈希码用于确定存储桶在数组中的位置,类似于

location = (arraylength - 1) & keyhashcode

: 表示按位与运算符。

例如:100 & "ABC".hashCode() = 64(键“ABC”的存储桶的位置)

在获取操作期间,它使用相同的方式来确定键的存储桶的位置。 在最好的情况下,每个键都有唯一的哈希码,并为每个键生成一个唯一的存储桶,在这种情况下,get 方法仅花费时间来确定存储桶位置并检索常量 O(1) 的值。

在最坏的情况下,所有键都具有相同的哈希码并存储在同一个存储桶中,这会导致遍历整个列表,从而导致 O(n)。

在java 8的情况下,如果链表桶的大小增长到超过8,则将链表桶替换为TreeMap,这将最坏情况下的搜索效率降低到O(log n)。

Elements inside the HashMap are stored as an array of linked list (node), each linked list in the array represents a bucket for unique hash value of one or more keys.
While adding an entry in the HashMap, the hashcode of the key is used to determine the location of the bucket in the array, something like:

location = (arraylength - 1) & keyhashcode

Here the & represents bitwise AND operator.

For example: 100 & "ABC".hashCode() = 64 (location of the bucket for the key "ABC")

During the get operation it uses same way to determine the location of bucket for the key. Under the best case each key has unique hashcode and results in a unique bucket for each key, in this case the get method spends time only to determine the bucket location and retrieving the value which is constant O(1).

Under the worst case, all the keys have same hashcode and stored in same bucket, this results in traversing through the entire list which leads to O(n).

In the case of java 8, the Linked List bucket is replaced with a TreeMap if the size grows to more than 8, this reduces the worst case search efficiency to O(log n).

我们已经确定,哈希表查找的标准描述是 O(1),指的是平均情况下的预期时间,而不是严格的最坏情况下的性能。 对于通过链接解决冲突的哈希表(如 Java 的哈希图),这在技术上是 O(1+α) ,其中 一个好的哈希函数,其中α是表的负载因子。 只要您存储的对象数量不超过表大小的常数因子,该值就保持不变。

还解释说,严格来说,构建需要 O(n) 次查找任何确定性哈希函数的输入是可能的。 但考虑最坏情况下的预期时间也很有趣,它与平均搜索时间不同。 使用链接时,这是 O(1 + 最长链的长度),例如当 α=1 时 θ(log n / log log n)。

如果您对实现恒定时间预期最坏情况查找的理论方法感兴趣,您可以阅读动态完美哈希 递归地解决与另一个哈希表的冲突!

We've established that the standard description of hash table lookups being O(1) refers to the average-case expected time, not the strict worst-case performance. For a hash table resolving collisions with chaining (like Java's hashmap) this is technically O(1+α) with a good hash function, where α is the table's load factor. Still constant as long as the number of objects you're storing is no more than a constant factor larger than the table size.

It's also been explained that strictly speaking it's possible to construct input that requires O(n) lookups for any deterministic hash function. But it's also interesting to consider the worst-case expected time, which is different than average search time. Using chaining this is O(1 + the length of the longest chain), for example Θ(log n / log log n) when α=1.

If you're interested in theoretical ways to achieve constant time expected worst-case lookups, you can read about dynamic perfect hashing which resolves collisions recursively with another hash table!

段念尘 2024-08-01 11:24:14

仅当您的散列函数非常好时,它才是 O(1)。 Java 哈希表实现无法防止错误的哈希函数。

添加项目时是否需要扩大表与问题无关,因为它与查找时间有关。

It is O(1) only if your hashing function is very good. The Java hash table implementation does not protect against bad hash functions.

Whether you need to grow the table when you add items or not is not relevant to the question because it is about lookup time.

少女七分熟 2024-08-01 11:24:14

这基本上适用于大多数编程语言中的大多数哈希表实现,因为算法本身并没有真正改变。

如果表中不存在冲突,则只需执行一次查找,因此运行时间为 O(1)。 如果存在冲突,则必须执行多次查找,这会将性能降低到 O(n)。

This basically goes for most hash table implementations in most programming languages, as the algorithm itself doesn't really change.

If there are no collisions present in the table, you only have to do a single look-up, therefore the running time is O(1). If there are collisions present, you have to do more than one look-up, which drives down the performance towards O(n).

无妨# 2024-08-01 11:24:14

这取决于您选择避免冲突的算法。 如果您的实现使用单独的链接,那么最坏的情况会发生,即每个数据元素都被哈希为相同的值(例如,哈希函数的选择不当)。 在这种情况下,数据查找与链表上的线性搜索(即 O(n))没有什么不同。 然而,发生这种情况的概率可以忽略不计,并且查找最佳情况和平均情况保持不变,即 O(1)。

It depends on the algorithm you choose to avoid collisions. If your implementation uses separate chaining then the worst case scenario happens where every data element is hashed to the same value (poor choice of the hash function for example). In that case, data lookup is no different from a linear search on a linked list i.e. O(n). However, the probability of that happening is negligible and lookups best and average cases remain constant i.e. O(1).

乜一 2024-08-01 11:24:14

仅在理论情况下,当哈希码始终不同且每个哈希码的存储桶也不同时,O(1) 才会存在。 否则,它的顺序是恒定的,即在散列图的增量上,其搜索顺序保持恒定。

Only in theoretical case, when hashcodes are always different and bucket for every hash code is also different, the O(1) will exist. Otherwise, it is of constant order i.e. on increment of hashmap, its order of search remains constant.

明月夜 2024-08-01 11:24:14

撇开学术不谈,从实践的角度来看,HashMap 应该被认为对性能影响无关紧要(除非你的分析器另有说明。)

Academics aside, from a practical perspective, HashMaps should be accepted as having an inconsequential performance impact (unless your profiler tells you otherwise.)

苍风燃霜 2024-08-01 11:24:14

当然,哈希图的性能取决于给定对象的 hashCode() 函数的质量。 然而,如果函数的实现使得冲突的可能性非常低,那么它将具有非常好的性能(这在每种可能的情况下并不是严格的 O(1),但它在 >大多数情况)。

例如,Oracle JRE 中的默认实现是使用随机数(存储在对象实例中,以便它不会更改 - 但它也会禁用偏向锁定,但这是另一个讨论),因此发生冲突的机会是非常低。

Of course the performance of the hashmap will depend based on the quality of the hashCode() function for the given object. However, if the function is implemented such that the possibility of collisions is very low, it will have a very good performance (this is not strictly O(1) in every possible case but it is in most cases).

For example the default implementation in the Oracle JRE is to use a random number (which is stored in the object instance so that it doesn't change - but it also disables biased locking, but that's an other discussion) so the chance of collisions is very low.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文