我正在用 C 语言研究哈希表,并且正在测试字符串的哈希函数。
我尝试的第一个功能是添加 ascii 代码并使用模 (% 100
),但第一次数据测试结果很差:130 个单词有 40 次冲突。
最终输入数据将包含 8000 个单词(它是存储在文件中的字典)。哈希表声明为 int table[10000]
并包含单词在 .txt 文件中的位置。
- 哈希字符串的最佳算法是什么?
- 那么如何确定哈希表的大小呢?
I'm working on hash table in C language and I'm testing hash function for string.
The first function I've tried is to add ascii code and use modulo (% 100
) but i've got poor results with the first test of data: 40 collisions for 130 words.
The final input data will contain 8000 words (it's a dictionary stores in a file). The hash table is declared as int table[10000]
and contains the position of the word in a .txt file.
- Which is the best algorithm for hashing string?
- And how to determinate the size of hash table?
发布评论
评论(11)
我使用 Dan Bernstein 的
djb2
取得了不错的效果。I've had nice results with
djb2
by Dan Bernstein.首先,您通常不想对哈希表使用加密哈希。按照加密标准来看非常的算法,按照哈希表标准来看仍然慢得令人难以忍受。
其次,您要确保输入的每一位都可以/将会影响结果。一种简单的方法是将当前结果旋转一定位数,然后将当前哈希码与当前字节进行异或。重复直到到达字符串的末尾。请注意,您通常也不希望旋转是字节大小的偶数倍。
例如,假设 8 位字节的常见情况,您可能会旋转 5 位:
编辑:另请注意,对于哈希表大小来说,10000 个槽很少是一个好的选择。您通常需要以下两件事之一:您要么需要质数作为大小(确保某些类型的哈希解析的正确性所需),要么需要 2 的幂(因此可以通过简单的方法将值减少到正确的范围)位掩码)。
First, you generally do not want to use a cryptographic hash for a hash table. An algorithm that's very fast by cryptographic standards is still excruciatingly slow by hash table standards.
Second, you want to ensure that every bit of the input can/will affect the result. One easy way to do that is to rotate the current result by some number of bits, then XOR the current hash code with the current byte. Repeat until you reach the end of the string. Note that you generally do not want the rotation to be an even multiple of the byte size either.
For example, assuming the common case of 8 bit bytes, you might rotate by 5 bits:
Edit: Also note that 10000 slots is rarely a good choice for a hash table size. You usually want one of two things: you either want a prime number as the size (required to ensure correctness with some types of hash resolution) or else a power of 2 (so reducing the value to the correct range can be done with a simple bit-mask).
我想验证卞晓宁的回答,可惜他没有贴出他的代码。因此,我实现了一个小测试套件,并在 列表上运行了不同的小哈希函数466K 英语单词 查看每个单词的冲突数量:
我包括了两者的时间:分别对所有单词进行哈希处理,并对所有英语单词的整个文件进行一次哈希处理。我还在我的测试中添加了一个更复杂的
MurmurHash3_x86_32
以供参考。此外,我还计算了“雪崩”,它是衡量连续单词的哈希值的不可预测性的指标。任何低于 100% 的值都意味着“相当可预测”(这对于哈希表可能没问题,但对于其他用途来说很糟糕)。结论:
Liz
/MHz
、Bon
/COM
、Rey
/性别
。hash = 17000069 * hash + s[i]
仅产生 25 次冲突,而 DJB2 则产生 344 次冲突。测试代码(使用
gcc -O2
编译):PS 对现代哈希函数的速度和质量的更全面的审查可以在 SMHasher 存储库。请注意表中的“质量问题”列。
I wanted to verify Xiaoning Bian's answer, but unfortunately he didn't post his code. So I implemented a little test suite and ran different little hashing functions on the list of 466K English words to see number of collisions for each:
I included time for both: hashing all words individually and hashing the entire file of all English words once. I also included a more complex
MurmurHash3_x86_32
into my test for reference. Additionally, I also calculated "avalanching", which is a measure of how unpredictable the hashes of consecutive words are. Anything below 100% means "quite predictable" (which might be fine for hashing tables, but is bad for other uses).Conclusions:
Liz
/MHz
,Bon
/COM
,Rey
/SEX
.hash = 17000069 * hash + s[i]
produces only 25 collisions compared to DJB2's 344 collisions.Test code (compiled with
gcc -O2
):P.S. A more comprehensive review of speed and quality of modern hash functions can be found in SMHasher repository of Reini Urban (rurban). Notice the "Quality problems" column in the table.
维基百科展示了一个很好的字符串哈希函数,称为 Jenkins One At A Time Hash。它还引用了该哈希的改进版本。
Wikipedia shows a nice string hash function called Jenkins One At A Time Hash. It also quotes improved versions of this hash.
djb2
很好虽然
djb2
,正如 cnicutar 在 stackoverflow 上介绍的那样,几乎可以肯定更好,我认为也值得显示 K&R 哈希值:K&R 哈希值之一是很糟糕,有一个可能相当不错:
% HASHSIZE
。另外,我建议您将返回和“hashval”类型设置为unsigned long
,甚至更好:uint32_t
或uint64_t
,而不是简单的<代码>无符号(int)。 这是一个简单的算法,通过执行以下算法风格来考虑字符串中每个字节的字节顺序:hashvalue = new_byte + 31*hashvalue
,例如字符串中的所有字节:请注意,从这两种算法可以清楚地看出,第一版哈希如此糟糕的原因之一是它没有考虑字符串字符顺序,因此
hash(“ab”)
因此将返回与hash("ba")
相同的值。然而,对于第二版哈希来说,情况并非如此,它会(更好!)为这些字符串返回两个不同的值。std::unordered_map
模板容器哈希表使用的 GCC C++11 哈希函数非常优秀。用于
unordered_map
的 GCC C++11 哈希函数 (哈希表模板)和unordered_set
(哈希集模板)如下所示。代码:
Austin Appleby 的 MurmerHash3 是最好!这甚至比上面使用的 gcc C++11
std::unordered_map
哈希有所改进。Austin 不仅是所有这些中最好的,而且还将 MurmerHash3 发布到公共领域。在这里查看我的其他答案:C++ std::unordered_map 中使用的默认哈希函数是什么?。
另请参阅
djb2
is goodThough
djb2
, as presented on stackoverflow by cnicutar, is almost certainly better, I think it's worth showing the K&R hashes too:One of the K&R hashes is terrible, one is probably pretty good:
% HASHSIZE
from the return statement if you plan on doing the modulus sizing-to-your-array-length outside the hash algorithm. Also, I recommend you make the return and "hashval" typeunsigned long
, or even better:uint32_t
oruint64_t
, instead of the simpleunsigned
(int). This is a simple algorithm which takes into account byte order of each byte in the string by doing this style of algorithm:hashvalue = new_byte + 31*hashvalue
, for all bytes in the string:Note that it's clear from the two algorithms that one reason the 1st edition hash is so terrible is because it does NOT take into consideration string character order, so
hash("ab")
would therefore return the same value ashash("ba")
. This is not so with the 2nd edition hash, however, which would (much better!) return two different values for those strings.The GCC C++11 hashing function used by the
std::unordered_map<>
template container hash table is excellent.The GCC C++11 hashing functions used for
unordered_map
(a hash table template) andunordered_set
(a hash set template) appear to be as follows.Code:
MurmerHash3 by Austin Appleby is best! It's an improvement over even his gcc C++11
std::unordered_map<>
hash used above.Not only is is the best of all of these, but Austin released MurmerHash3 into the public domain. See my other answer on this here: What is the default hash function used in C++ std::unordered_map?.
See also
djb2 对于 这个 466k 英语词典 有 317 次冲突,而 MurmurHash 没有对于 64 位哈希值,21 个用于 32 位哈希值(对于 466k 随机 32 位哈希值,预计约为 25)。
我的建议是使用 MurmurHash(如果可用),它非常快,因为它一次需要几个字节时间。但是,如果您需要一个简单而短的哈希函数来复制并粘贴到您的项目中,我建议使用 murmurs 一次一个字节的版本:
哈希表的最佳大小 - 简而言之 - 尽可能大同时仍然适合记忆。因为我们通常不知道或不想查找我们有多少可用内存,而且它甚至可能会改变,所以最佳哈希表大小大约是表中存储的预期元素数量的 2 倍。分配比这更多的值将使您的哈希表更快,但回报会迅速递减,使您的哈希表小于该值将使其速度呈指数级下降。这是因为空间和时间复杂度之间存在非线性权衡哈希表,最佳负载因子为 2-sqrt(2) = 0.58...显然。
djb2 has 317 collisions for this 466k english dictionary while MurmurHash has none for 64 bit hashes, and 21 for 32 bit hashes (around 25 is to be expected for 466k random 32 bit hashes).
My recommendation is using MurmurHash if available, it is very fast, because it takes in several bytes at a time. But if you need a simple and short hash function to copy and paste to your project I'd recommend using murmurs one-byte-at-a-time version:
The optimal size of a hash table is - in short - as large as possible while still fitting into memory. Because we don't usually know or want to look up how much memory we have available, and it might even change, the optimal hash table size is roughly 2x the expected number of elements to be stored in the table. Allocating much more than that will make your hash table faster but at rapidly diminishing returns, making your hash table smaller than that will make it exponentially slower. This is because there is a non-linear trade-off between space and time complexity for hash tables, with an optimal load factor of 2-sqrt(2) = 0.58... apparently.
有许多现有的 C 哈希表实现,从 C 标准库 hcreate/hdestroy/hsearch 到 APR 和 < a href="http://developer.gnome.org/glib/">glib,它还提供预构建的哈希函数。我强烈建议使用它们,而不是发明自己的哈希表或哈希函数;它们针对常见用例进行了大量优化。
但是,如果您的数据集是静态的,那么最好的解决方案可能是使用完美哈希。 gperf 将为给定的数据集生成完美的哈希值。
There are a number of existing hashtable implementations for C, from the C standard library hcreate/hdestroy/hsearch, to those in the APR and glib, which also provide prebuilt hash functions. I'd highly recommend using those rather than inventing your own hashtable or hash function; they've been optimized heavily for common use-cases.
If your dataset is static, however, your best solution is probably to use a perfect hash. gperf will generate a perfect hash for you for a given dataset.
首先,130 个单词的 40 次冲突哈希为 0..99 是否不好?如果您不采取专门的措施来实现完美的散列,那么您就不能期望它会发生。大多数时候,普通的哈希函数的冲突并不比随机生成器少。
声誉良好的哈希函数是 MurmurHash3。
最后,关于哈希表的大小,这实际上取决于您想要什么样的哈希表,特别是存储桶是可扩展的还是单槽的。如果存储桶是可扩展的,那么还有一个选择:您可以根据内存/速度限制选择平均存储桶长度。
First, is 40 collisions for 130 words hashed to 0..99 bad? You can't expect perfect hashing if you are not taking steps specifically for it to happen. An ordinary hash function won't have fewer collisions than a random generator most of the time.
A hash function with a good reputation is MurmurHash3.
Finally, regarding the size of the hash table, it really depends what kind of hash table you have in mind, especially, whether buckets are extensible or one-slot. If buckets are extensible, again there is a choice: you choose the average bucket length for the memory/speed constraints that you have.
我尝试了这些哈希函数并得到了以下结果。我有大约 960^3 个条目,每个条目长 64 个字节,64 个不同顺序的字符,哈希值 32 位。代码来自此处。
一件奇怪的事情是,几乎所有哈希函数对我的数据都有 6% 的冲突率。
I have tried these hash functions and got the following result. I have about 960^3 entries, each 64 bytes long, 64 chars in different order, hash value 32bit. Codes from here.
One strange things is that almost all the hash functions have 6% collision rate for my data.
我使用过的效果很好的一件事如下(我不知道它是否已经被提及,因为我不记得它的名字了)。
您预先计算一个表 T,其中包含密钥字母表中每个字符的随机数 [0,255]。您通过采用 T[k0] xor T[k1] xor ... xor T[kN] 来散列密钥“k0 k1 k2 ... kN”。您可以轻松地证明,这与随机数生成器一样随机,并且在计算上非常可行,如果您确实遇到了具有大量冲突的非常糟糕的实例,您可以使用一批新的随机数重复整个过程。
One thing I've used with good results is the following (I don't know if its mentioned already because I can't remember its name).
You precompute a table T with a random number for each character in your key's alphabet [0,255]. You hash your key 'k0 k1 k2 ... kN' by taking T[k0] xor T[k1] xor ... xor T[kN]. You can easily show that this is as random as your random number generator and its computationally very feasible and if you really run into a very bad instance with lots of collisions you can just repeat the whole thing using a fresh batch of random numbers.
我想为像我这样的 C 新手总结一下。根据 Andriy Makukha 的精确工作,
MurmurHash3
是最好的。一个不错的 C 端口可以在 murmurhash.c 中找到。
I want to summarize it all for newbies to C like me. According to Andriy Makukha precision efforts,
MurmurHash3
is the best.A decent C port can be found in murmurhash.c.