.Net 并发字典中的 GrowTable 方法
您能解释一下 GrowTable 方法中的一些魔法吗?
代码:
// Compute the new table size. We find the smallest integer larger than twice the previous table size, and not divisible by
// 2,3,5 or 7. We can consider a different table-sizing policy in the future.
int newLength;
try
{
checked
{
// Double the size of the buckets table and add one, so that we have an odd integer.
newLength = buckets.Length * 2 + 1;
// Now, we only need to check odd integers, and find the first that is not divisible
// by 3, 5 or 7.
while (newLength % 3 == 0 || newLength % 5 == 0 || newLength % 7 == 0)
{
newLength += 2;
}
Assert(newLength % 2 != 0);
}
}
在其他.net集合(列表,字典)调整大小方法中保持双倍大小。
Can you explain some magic in GrowTable method?
Code:
// Compute the new table size. We find the smallest integer larger than twice the previous table size, and not divisible by
// 2,3,5 or 7. We can consider a different table-sizing policy in the future.
int newLength;
try
{
checked
{
// Double the size of the buckets table and add one, so that we have an odd integer.
newLength = buckets.Length * 2 + 1;
// Now, we only need to check odd integers, and find the first that is not divisible
// by 3, 5 or 7.
while (newLength % 3 == 0 || newLength % 5 == 0 || newLength % 7 == 0)
{
newLength += 2;
}
Assert(newLength % 2 != 0);
}
}
In other .net collections (List, Dictionary) resize method maintain double size.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
当您将数据存储在可增长的数组中时,增长多少的决定基本上是一个时空权衡。如果每当需要增大数组时就分配(从而浪费)大量内存,则执行速度会更快,因为不需要经常调整数组大小。 (就摊余时间复杂度而言,对于任何 x,增长 x% 都是相同的。但实际速度是不同的。)
通常没有记录数组增长多少(仅记录加法的时间复杂度)。例如,.Net
List
的当前 MS 实现大小增加了一倍,但 C++vector
当前的 MS 实现仅增长了 50% 。如果您在单个数组中实现它(这当然不是唯一的可能性),所有这些都可以应用于基于哈希表的字典中的条目集合。但你也必须考虑桶。它们也有时空权衡,但不同的是:如果你选择任何大小,字典就可以工作。但集合越大,冲突就越少。碰撞次数越少,查找时间就越短。如果您想要进行恒定时间查找,则存储桶的数量必须与字典中的项目数量成线性比例。因此,使存储桶数组的大小与条目数组的大小相同是有意义的。
但还有一件事。如果哈希码具有某种结构,您不希望该结构反映在字典中,因为这会导致更多冲突。假设您使用帐户 ID 作为哈希键。您将 id 从 0 开始分配给 IT 部门的用户,从 200 开始分配给营销部门的用户,等等。现在,如果存储桶数组的大小为 100,您将遇到很多冲突,从而导致糟糕的性能。如果数组的大小可以被 2 或 5 整除,则碰撞次数也会增加。
因此,桶数组的最佳大小是素数。下一个最佳大小是几乎素数(没有很多约数的数字)。和以前一样,有一个权衡:计算素数相对较慢。为了让它更快,你可以说几乎素数就足够了。或者您可以预先计算一些素数,这会消耗一些内存。
考虑到所有这些权衡,不存在单一的最佳解决方案。根据您的使用情况,您可以尝试微调所有这些参数并找出最适合您的参数。但图书馆作者必须让它们足够好,适合每个人,或者至少是大多数人。由于某些原因,
Dictionary
作者的选择与ConcurrentDictionary
作者的选择略有不同。当前
Dictionary
的 MS 实现使用的具体算法是拥有一个包含最多 7,199,369 个素数的表。它始终使用大于当前大小两倍的素数。对于较小的数字,它会从表中选择最小的素数。对于较大的数字,它会精确计算最小的素数。When you store data in a growable array, the decision how much to grow it is basically a space-time tradeoff. If you allocate (and thus waste) a lot of memory whenever you need to grow the array, you get faster execution, because you don't need to resize the array that often. (As far as amortized time complexity is concerned, growing by x% is the same for any x. But the actual speed is different.)
How much does the array grow usually isn't documented (only the time complexity of addition). For example, current MS implementation of the .Net
List<T>
grows to twice the size, but current MS implementation of the C++vector<T>
grows only by 50%.All of this could apply to the entries collection in a hash-table based dictionary, if you implement it in a single array (which certainly isn't the only possibility). But you have to consider the buckets too. They have too a space-time tradeoff, but a different one: if you choose any size, the dictionary works. But the bigger the collection, the less collisions you have. And with less collisions, you get faster lookup times. If you want to have constant-time lookup, the count of buckets has to be linearly proportional to the count of items in the dictionary. So it makes sense to make the size of the buckets array the same as the entries array.
But there's one more thing. If the hash codes have some structure, you don't want this structure to be reflected in your dictionary, because this causes more collisions. Let's say you used account ids as your hash keys. And you assign ids to users in the IT department starting from 0, to users in marketing from 200, etc. Now, if the size of the buckets array was 100, you would get many collisions and thus terrible performance. If the size of the array was divisible by 2 or 5, the number of collisions would go up too.
Because of this, the best size for the buckets array is a prime. The next best size is almost-prime (number that doesn't have many divisors). And as before, there is a tradeoff: computing primes is relatively slow. To make it faster, you can say almost-primes are good enough. Or you can precompute some of the primes, which costs some memory.
With all those tradeoffs, there is no single best solution. For your usage, you could try fine-tuning all those parameters and fine out which is best for you. But library writers have to make them good enough for everyone, or at least most everyone. And the authors of
Dictionary<T>
chose slightly differently than the authors ofConcurrentDictionary<T>
, for some reason.The specific algorithm that current MS implementation of
Dictionary<T>
uses is to have a table of some primes up to 7,199,369. It always uses a prime bigger than the twice the current size. For small numbers, it chooses the smallest such prime from the table. For larger numbers, it computes smallest such prime exactly.