在哈希地图中，我们如何获得桶被空的概率为0.5

发布于 2025-02-05 05:15:50 字数 212 浏览 1 评论 0 原文

Java官方文件说：“理想情况下，在随机的哈希码中，垃圾箱中的节点的频率遵循泊松分布（），默认调整大小为0.75”的参数平均约为0.5。我想知道我们如何使桶被空为0.5。如何数学证明它。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

再可℃爱ぅ一点好了 2025-02-12 05:15:51

这是关于0.5可能来自何处的讨论。

上下文在Java Hashmap 实现中，就应该将链接（“树”）从链接列表转换为红黑树的点。这应该在典型的操作下很少发生。选择“极少”的好价值是什么？ Hashmap实施中的实施评论基于泊松分布的理由：

 Ideally, under random hashCodes, the frequency of
 nodes in bins follows a Poisson distribution
 (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 parameter of about 0.5 on average for the default resizing
 threshold of 0.75, although with a large variance because of
 resizing granularity. Ignoring variance, the expected
 occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 factorial(k)).

https://github.com/openjdk/jdk/jdk/blob/jdk-17-ga/src/src/java/java.base/share/share/classes/java/java/java/util一下/hashmap.java#l181

它得出的结论是，具有8个或更多碰撞的桶发生，概率少于一千万分之一，因此选择8个被选为“ Treeify Threshold”。

0.5的参数从何而来？在这种情况下，参数是指哈希图的平均负载（或填充度），即映射的数量除以“容量”（表长度或存储桶的数量）。因此，问题可以重述为：hashmap的平均负载是多少？这不能从数学上得出，因为它取决于程序使用hashmap，程序的输入等的方式。但是我们可以做出一些合理的假设，并得出一些可能适用于典型情况的结论。

首先，假设大多数哈希图使用的负载系数为0.75。我认为这是真的，因为我很少看到代码使用明确的负载因子的情况。
第二，假设哈希图中的映射数量均匀分布。我几乎很肯定这是不是 true，但让我们从此开始作为一个工作的假设开始，然后重新审视它。
第三，让我们搁置hashmap包含超过十亿个元素的案例，其中诸如Java的数组尺寸限制之类的元素。
第四，为简单起见，让我们仅考虑在没有预先关联容量的情况下创建的hashmaps，并带有一些未知数的映射。

回想负载因子的定义：如果hashmap的负载超过了负载因子，则将施加尺寸更大，以使负载保持在负载因子以下。

显然，正常操作中的哈希图的负载不能超过0.75。哈希图的负载也不会小于0.375，因为它直到负载达到0.75，并调整了一半的负载大小。因此，也许平均载荷中途在0.375至0.75之间，即0.5625。

可以做一些数学来弄清楚这一点，但是我更容易编写一个程序来填充1和n之间的尺寸的标签，并计算所有这些情况下的平均负载。如果N大约是两个功率的0.75（例如768），则平均负载确实接近0.5625。但是，这取决于n的选择，这有很大不同。如果选择的n在两个连续的幂之间的中间位置（例如576），则平均负载仅为0.507。最多N映射的平均施法负载在约0.507和0.5625之间变化。

因此，根据n的选择，。这可能不是真的，但是实际分布是什么？我不知道，但是我想随着尺寸变大，有指数级的损坏。如果是这样，这将使大小的分布偏向较小的数量，将平均负载降低到比0.5625的0.507偏差。我还猜测，很多人都会用默认容量（16）创建哈希图，并用六个或更少的映射填充它们。那将使平均水平降低。

总而言之，我们的模型告诉我们，平均负载在0.507和0.5625之间。一些受过良好教育的猜测告诉我们，尺寸的偏差较小，因此平均负载也较小。因此，0.5的选择似乎是合理的。

但是，我们真的不需要精确的答案来分析何时treehmap bucket。分析的目的是找到一个阈值，以便在正常运行期间不太可能发生转换。如果使用0.4或0.7而不是0.5，则泊松分析仍表明，一个桶中有8次碰撞的机会仍然少于一千万分之一。

Here's a discussion of where the 0.5 probably came from.

The context is in the Java HashMap implementation, regarding the point at which a bucket should be converted ("treeified") from a linked list to a red-black tree. This should occur extremely rarely under typical operation. What's a good value to choose that gives "extremely rarely"? An implementation comment from the HashMap implementation bases its justification on a Poisson distribution:

 Ideally, under random hashCodes, the frequency of
 nodes in bins follows a Poisson distribution
 (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 parameter of about 0.5 on average for the default resizing
 threshold of 0.75, although with a large variance because of
 resizing granularity. Ignoring variance, the expected
 occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 factorial(k)).

https://github.com/openjdk/jdk/blob/jdk-17-ga/src/java.base/share/classes/java/util/HashMap.java#L181

It concludes that a bucket with 8 or more collisions occurs with a probability of less than one in ten million, so 8 was chosen as the "treeify threshold".

Where did the parameter of 0.5 come from? In this case, the parameter means the average load (or fullness) of a HashMap, that is, the number of mappings divided by the "capacity" (table length, or number of buckets). The question can thus be restated as: What is the average load of a HashMap? This can't be derived mathematically, because it depends on the way a program uses a HashMap, the program's input, and so forth. But we can make some reasonable assumptions and arrive at some conclusions which probably apply to typical cases.

First, assume most HashMaps use a load factor of 0.75. Anecdotally I believe this to be true, as I've rarely seen cases where code uses an explicit load factor.
Second, assume that the number of mappings in a HashMap is uniformly distributed. I'm almost positive this is not true, but let's start with this as a working assumption and revisit it later.
Third, let's set aside cases where a HashMap contains over a billion or so elements, where things like Java's array size limitations come into play.
Fourth, for simplicity, let's consider only HashMaps that are created without a preallocated capacity and are populated with some unknown number of mappings.

Recall the definition of load factor: if a HashMap's load exceeds the load factor, the HashMap will be resized larger to keep the load below the load factor.

Clearly, a HashMap in normal operation can't have a load of more than 0.75. A HashMap also won't have a load of less than 0.375, because it won't be resized until its load reaches 0.75, and resizing halves the load. So maybe the average load is midway between 0.375 and 0.75, which is 0.5625.

One could do some math to figure this out, but it was easier for me to write a program to populate HashMaps of sizes between 1 and N and compute the average load over all those cases. If N is about 0.75 of a power of two (say, 768) then the average load is indeed quite close to 0.5625. However, this varies a lot depending on one's choice of N. If a chosen N is midway between 0.75 of successive powers of two (say, 576) then the average load is only 0.507. So the average load of HashMaps with up to N mappings varies between about 0.507 and 0.5625, depending on one's choice of N. (This is the "resizing granularity" variance mentioned in the comment.)

Recall that we assumed that HashMap sizes are uniformly distributed. It's probably not true, but what is the actual distribution? I don't know, but I'd guess that there is exponential falloff as the sizes get larger. If so, this would skew the distribution of sizes towards smaller numbers, reducing the average load to be closer to 0.507 than 0.5625. I'd also guess that a lot of people create HashMaps with the default capacity (16) and populate them with six or fewer mappings. That would pull the average down farther.

In summary, our model is telling us that the average load is somewhere between 0.507 and 0.5625. Some educated guesses tell us that sizes are skewed smaller, so the average load is also smaller. The choice of 0.5 therefore seems reasonable.

But we really don't need a precise answer for the analysis of when treeify HashMap buckets. The point of the analysis is to find a threshold such that conversion is extremely unlikely to occur during normal operation. If 0.4 or 0.7 were used instead of 0.5, the Poisson analysis would still indicate that the chance of having 8 collisions in a single bucket is still less than one in ten million.

回复收藏 0 原文

~没有更多了~