哈希表查找 - 具有完美哈希,C 语言
我有一个 C 语言应用程序,需要在其中进行表查找。
这些条目是字符串,所有内容在运行时开始时都是已知的。该表初始化一次,然后查找多次。该表可以更改,但基本上就像应用程序重新开始一样。我认为这意味着我可以使用完美哈希?哈希表初始化花费一些时间是可以的,因为它只发生一次。
会有 3 到 100,000 个条目,每个条目都是唯一的,我估计 80% 的情况下条目少于 100 个。在这些情况下,简单的简单查找“足够快”。 (==没有人抱怨)
但是,在有 10k+ 条目的情况下,简单方法的查找速度是不可接受的。什么是为 C 中的字符串提供良好的基于哈希表的查找性能的好方法? 假设我没有像 Boost/etc 这样的第三方商业库。我应该使用什么哈希算法?我该如何决定?
I have a C-language app where I need to do table lookups.
The entries are strings, All are known at the start of runtime. The table is initialized once, and then looked up many times. The table can change, but it's basically as if the app starts over. I think this means I can use a perfect-hash? It's ok to consume some time for the hashtable initialization, as it happens just once.
There will be between 3 and 100,000 entries, each one unique, and I estimate that 80% of cases will have fewer than 100 entries. A simple naive lookup is "fast enough" in those cases. (== no one is complaining)
However in the cases where there are 10k+ entries, the lookup speed of a naive approach is unacceptable. What's a good approach for delivering good hashtable-based lookup performance for strings in C?
Assume I do not have a 3rd-party commercial library like Boost/etc. What hash algorithm should I use? how do I decide?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
生成完美的哈希值并不是一个简单的问题。有一些图书馆专门致力于这项任务。
在这种情况下,最受欢迎的可能是 CMPH。我还没有使用过它,所以无法提供更多帮助。 gperf 是另一个工具,但它要求在编译时知道字符串(您可以工作通过编译 .so 并加载来解决它,但有点矫枉过正)。
但坦率地说,我至少会尝试先进行二分搜索。只需使用
qsort
对数组进行排序,然后使用bsearch
进行搜索(或自行创建)。自 C89 以来,这两个都是 stdlib.h 的一部分。Generating a perfect hash is not a simple problem. There's libraries devoted to the task.
In this case the most popular one is probably CMPH. I haven't used it though so can't help beyond that. gperf is another tool, but it requires the strings to be known at compile time (you could work around it by compiling a .so and loading, but kind of overkill).
But frankly, I'd at least try to go with a binary search first. Simply sort the array using
qsort
, then search withbsearch
(or roll your own). Both those are part ofstdlib.h
since C89.如果简单的(我假设你的意思是线性)方法对于 100 个条目是可以的(因此平均进行 50 次比较),那么二分搜索对于 100,000 个条目就足够了(最多需要 17 次比较)。
所以我根本不会打扰哈希,而只是在启动时对字符串表进行排序(例如使用
qsort
),然后使用二分搜索(例如使用bsearch< /code>
)来查找条目。
If a naive (I assume you mean linear) approach is ok for 100 entries (so 50 comparisons are done on average) then a binary search will be more than sufficient for 100,000 entries (it takes at most 17 comparisons).
So I wouldn't bother with hashes at all but just resort to sorting your string table on startup (e.g. using
qsort
) and later using a binary search (e.g. usingbsearch
) to look up entries.如果(最大)表大小已知,则带有链接的普通哈希表很容易实现。每个项目的大小开销仅为两个整数。使用合理的散列函数,每次查找平均只需要 1.5 次探测,这对于 100% 加载的表而言。
仅当您的数据不发生变化时,构建完美的哈希才是可行的。一旦发生变化,您就必须重新计算和重新散列,这比进行一些额外的比较要昂贵得多。
If the (maximal) table size is known, a plain hashtable with chaining is very easy to implement. Size overhead is only two ints per item. With a reasonable hash function only 1.5 probes per lookup are needed on average, this for a 100% loaded table.
Constructing a perfect hash is only feasible if your data does not change. Once it changes, you'll have to recompute and rehash, which is way more expensive than doing a few extra compares.