如何检查我的自定义哈希在 hash_map 中是否良好?
我已经在 stdext::hash_map 中为我的自定义键编写了自定义哈希,并想检查哈希器是否良好。我使用的是 VS 2008 提供的 STL。据我所知,典型的检查是检查存储桶之间分布的均匀性。
我应该如何正确组织这样的检查?我想到的一个解决方案是修改 STL 源代码,向 hash_map 添加一个方法,该方法可以遍历存储桶并完成主题。有没有更好的方法?
也许,从 hash_map 派生并在那里创建这样的方法?
I've written a custom hashing for my custom key in stdext::hash_map and would like to check whether the hasher is good. I'm using STL supplied with VS 2008. A typical check, as I know, is to check the uniformity of distribution among buckets.
How should I organize such a check correctly? A solution that comes to my mind is to modify STL sources to add a method to hash_map that walks through buckets and does the subject. Is there are any better ways?
Maybe, derive from hash_map and create there such method?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
最好的选择可能是将哈希算法应用于整数数组,并根据实际数据计算每个哈希桶被命中的次数。 (我真的建议将 STL 从方程中剔除。)
如果您最终发现大量现实世界数据的计数偏差很大,那么当存在大量空值时,您的散列算法就会产生大量冲突。 (或更空的)桶可用。
请注意,“高偏差”是一个相对术语。一个好的哈希算法是一个确定性的随机过程,任何随机过程都有可能产生奇怪的结果,因此经常测试,好好测试,并尽可能使用实际问题域作为测试和控制的来源。
Your best bet might be to just take your hashing algorithm to an array of ints and count the number of times that each hash bucket is hit, given real-world data. (I'm suggesting taking the STL out of the equation here, really.)
If you end up seeing high deviation in your counts with large sets of real-world data, your hashing algorithm is generating lots of collisions when there are plenty of empty (or emptier) buckets available.
Note that 'high deviation' is a relative term. A good hash algorithm is a deterministic random process and any random process has a chance of generating strange results, so test often, test well, and wherever possible, use your actual problem domain as a source of your tests and your controls.
我将通过 stl::hash_map 运行一个(大型)数据集。 收集所有存储桶的结果
完成后,我将使用以下方法从
hash_map
:最后,我将计算 的标准偏差 (SD) elem-to-bucket 分布。
我会针对不同的哈希函数执行上述操作。无论哪个哈希函数产生最小 SD 都是获胜者(对于此数据集)。
I'd run one (large) dataset through stl::hash_map. Once done, I'd collect the results for all buckets using the following method
From
hash_map
:Finally, I would do compute the standard deviation (SD) of the elem-to-bucket distribution.
I'd do the above for different hash functions. Whichever hash function results in minimum SD is the winner (for this dataset).