适用于混合数字和文字标识符的最佳哈希函数

发布于 2024-08-15 08:08:57 字数 467 浏览 10 评论 0原文

出于性能原因，我需要将字符串标识的一组对象分成几组。对象可以通过数字或前缀（限定）形式的字符串来标识，并用点分隔标识符的各个部分：

12
323
12343
2345233
123123131
ns1:my.label.one
ns1:my.label.two
ns1:my.label.three
ns1:system.text.one
ns2:edit.box.grey
ns2:edit.box.black
ns2:edit.box.mixed

数字标识符从 1 到几百万。文本标识符很可能有很多以相同的名称空间前缀 (ns1:) 和相同的路径前缀 (edit.box.) 开头。

为此目的最好的哈希函数是什么？如果我可以根据对象标识符统计信息以某种方式预测存储桶的大小，那就太好了。是否有一些基于某些统计信息构建良好哈希函数的好文章？

这样的标识符有数百万个，但目的是根据哈希函数将它们分成 1-2000 个组。

原文

For performance reasons I have a need to split a set of objects identified by a string into groups. Objects may be either identified by a number or by a string in prefixed (qualified) form with dots separating parts of the identifier:

12
323
12343
2345233
123123131
ns1:my.label.one
ns1:my.label.two
ns1:my.label.three
ns1:system.text.one
ns2:edit.box.grey
ns2:edit.box.black
ns2:edit.box.mixed

Numeric identifiers are from 1 to several millions. Text identifiers are most likely to have very many starting with the same name space prefix (ns1:) and with the same path prefix (edit.box.).

What is the best hash function for this purpose? It would be good if I can predict somehow the size of the bucket based on object identifier statistics. Are there some good articles for constructing good hash function based on some statistical information?

There are several millions of such identifiers, but the purpose is to split them into groups of 1-2 thousands based on the hash function.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

唔猫 2024-08-22 08:08:57

两个好的哈希函数都可以映射到相同的值空间，并且通常不会因组合它们而导致任何新问题。

因此，您的哈希函数可以如下所示：

if it's an integer value:
    return int_hash(integer value)
return string_hash(string value)

除非整数围绕以 N 为模的某些值（其中 N 是可能的存储桶数量）发生聚集，否则 int_hash 可以仅返回其输入。

选择字符串哈希并不是一个新问题。尝试“djb2”（http://www.cse.yorku.ca/~oz /hash.html）或类似的，除非你有令人讨厌的性能要求。

我认为修改哈希函数以考虑公共前缀没有多大意义。如果您的哈希函数一开始就很好，那么公共前缀不太可能会产生任何哈希值聚集。

如果你这样做，并且哈希不会意外地表现得很差，并且你将几百万哈希值放入几千个桶中，那么桶中的数量将呈正态分布，具有平均值（几百万/几千）和方差1/12（几千）^2

每个存储桶平均有 1500 个条目，这使得标准偏差约为 430。正态分布的 95% 位于平均值的 2 个标准差之内，因此 95% 的存储桶将包含 640-2360 个条目，除非我算错了。这是否足够，或者您是否需要桶的尺寸更接近？

Two good hash functions can both be mapped into the same space of values, and will in general not cause any new problems as a result of combining them.

So your hash function can look like this:

if it's an integer value:
    return int_hash(integer value)
return string_hash(string value)

Unless there's any clumping of your integers around certain values modulo N, where N is a possible number of buckets, then int_hash can just return its input.

Picking a string hash is not a novel problem. Try "djb2" (http://www.cse.yorku.ca/~oz/hash.html) or similar, unless you have obscene performance requirements.

I don't think there's much point in modifying the hash function to take account of the common prefixes. If your hash function is good to start with, then it is unlikely that common prefixes will create any clumping of hash values.

If you do this, and the hash doesn't unexpectedly perform badly, and you put your several million hash values into a few thousand buckets, then the bucket populations will be normally distributed, with mean (several million / a few thousand) and variance 1/12 (a few thousand)^2

With an average of 1500 entries per bucket, that makes the standard deviation somewhere around 430. 95% of a normal distribution lies within 2 standard deviations of the mean, so 95% of your buckets will contain 640-2360 entries, unless I've done my sums wrong. Is that adequate, or do you need the buckets to be of more closely similar sizes?

回复收藏 0 原文