使用 hash_map 时,对 stl 字符串使用的最佳哈希算法是什么?
我发现 VS2005 上的标准哈希函数在尝试实现高性能查找时速度非常慢。 有哪些快速有效的哈希算法可以避免大多数冲突的好例子?
I've found the standard hashing function on VS2005 is painfully slow when trying to achieve high performance look ups. What are some good examples of fast and efficient hashing algorithms that should void most collisions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
我与 Microsoft Research 的 Paul Larson 合作实现了一些哈希表。 他研究了各种数据集上的多个字符串哈希函数,发现一个简单的乘以 101 和加循环的效果出奇的好。
I worked with Paul Larson of Microsoft Research on some hashtable implementations. He investigated a number of string hashing functions on a variety of datasets and found that a simple multiply by 101 and add loop worked surprisingly well.
从我的一些旧代码来看:
它很快。 确实快得要命。
From some old code of mine:
Its fast. Really freaking fast.
这始终取决于您的数据集。
我通过使用字符串的 CRC32 获得了令人惊讶的良好结果。 适用于各种不同的输入集。
在网上很容易找到许多好的 CRC32 实现。
编辑:差点忘了:这个页面有一个很好的哈希函数对比,其中包含性能数据和测试数据:
http://smallcode.weblogs.us/ <-- 页面下方。
That always depends on your data-set.
I for one had surprisingly good results by using the CRC32 of the string. Works very good with a wide range of different input sets.
Lots of good CRC32 implementations are easy to find on the net.
Edit: Almost forgot: This page has a nice hash-function shootout with performance numbers and test-data:
http://smallcode.weblogs.us/ <-- further down the page.
Boost 有一个 boost::hash 库,可以提供最常见类型的一些基本哈希函数。
Boost has an boost::hash library which can provides some basic hash functions for most common types.
我使用 Jenkins 哈希编写了一个 Bloom 过滤器库,它具有出色的性能。
详细信息和代码可在此处找到:http://burtleburtle.net/bob/c/lookup3.c< /a>
这就是 Perl 用于散列操作的内容,fwiw。
I've use the Jenkins hash to write a Bloom filter library, it has great performance.
Details and code are available here: http://burtleburtle.net/bob/c/lookup3.c
This is what Perl uses for its hashing operation, fwiw.
如果您要对一组固定的单词进行哈希处理,则最好的哈希函数通常是完美哈希函数。 但是,它们通常要求您尝试散列的单词集在编译时已知。 在词法分析器中检测关键字(并将关键字转换为标记)是完美哈希的常见用法使用 gperf 等工具生成的函数。 完美的哈希还可以让您用简单的数组或向量替换
hash_map
。如果您没有对一组固定的单词进行哈希处理,那么显然这不适用。
If you are hashing a fixed set of words, the best hash function is often a perfect hash function. However, they generally require that the set of words you are trying to hash is known at compile time. Detection of keywords in a lexer (and translation of keywords to tokens) is a common usage of perfect hash functions generated with tools such as gperf. A perfect hash also lets you replace
hash_map
with a simple array orvector
.If you're not hashing a fixed set of words, then obviously this doesn't apply.
对于字符串哈希的一个经典建议是逐个遍历字母,将它们的 ascii/unicode 值添加到累加器中,每次将累加器乘以一个素数。 (允许散列值溢出)
在不丢弃数据的情况下很难获得比这更快的速度。 如果您知道您的字符串只能通过几个字符而不是其全部内容来区分,那么您可以做得更快。
如果计算值过于频繁,您可以尝试通过创建 basic_string 的新子类来更好地缓存哈希值,该子类会记住其哈希值。 不过,hash_map 应该在内部执行此操作。
One classic suggestion for a string hash is to step through the letters one by one adding their ascii/unicode values to an accumulator, each time multiplying the accumulator by a prime number. (allowing overflow on the hash value)
It's hard to get faster than that without throwing out data. If you know your strings can be differentiated by only a few characters and not their whole content, you can do faster.
You might try caching the hash value better by creating a new subclass of basic_string that remembers its hash value, if the value gets calculated too often. hash_map should be doing that internally, though.
我做了一些搜索,有趣的是,保罗·拉尔森的小算法出现在这里
http://www.strchr.com/hash_functions
在多种条件下进行的测试中,碰撞次数是最少的,而且无论是展开还是工作台驱动,速度都非常快。
Larson 就是上面那个简单的乘以 101 并加循环。
I did a little searching, and funny thing, Paul Larson's little algorithm showed up here
http://www.strchr.com/hash_functions
as having the least collisions of any tested in a number of conditions, and it's very fast for one that it's unrolled or table driven.
Larson's being the simple multiply by 101 and add loop above.
Python 3.4 包含基于 SipHash 的新哈希算法。 PEP 456 内容非常丰富。
Python 3.4 includes a new hash algorithm based on SipHash. PEP 456 is very informative.
从哈希函数一路向下:
如果没有更多信息,我会推荐 MurmurHash 作为通用非加密哈希函数。 对于小字符串(程序中平均标识符的大小),非常简单且著名的 djb2 和 FNV 非常好。
djb2
FNV-1
FNV-1A
关于安全性和可用性的说明
哈希函数可能会使您的代码容易受到拒绝服务攻击。 如果攻击者能够强制您的服务器处理过多的冲突,您的服务器可能无法处理请求。
一些哈希函数,例如 MurmurHash 接受您可以提供的种子以大幅减少攻击者预测您的服务器软件生成的哈希值的能力。 记住这一点。
From Hash Functions all the way down:
Without more information I would recommend MurmurHash as a general purpose non-cryptographic hash function. For small strings (of the size of the average identifier in programs) the very simple and famous djb2 and FNV are very good.
djb2
FNV-1
FNV-1A
A note about security and availability
Hash functions can make your code vulnerable to denial-of-service attacks. If an attacker is able to force your server to handle too many collisions, your server may not be able to cope with requests.
Some hash functions like MurmurHash accept a seed that you can provide to drastically reduce the ability of attackers to predict the hashes your server software is generating. Keep that in mind.
如果您的字符串平均比单个缓存行长,但它们的长度+前缀相当唯一,请考虑仅使用长度+前 8/16 个字符。 (长度包含在 std::string 对象本身中,因此读取起来很便宜)
If your strings are on average longer than a single cache line, but their length+prefix are reasonably unique, consider hasing just the length+first 8/16 characters. (The length is contained in the std::string object itself and therefore cheap to read)