制作一个完美的哈希(所有连续的桶都已满)、gperf 还是替代方案?

发布于 2024-08-12 10:00:20 字数 610 浏览 9 评论 0 原文

假设我想构建一个完美的哈希表来查找预定义键为 12 个月的数组,因此我希望

hash("January")==0
hash("December")==11

通过 gperf 并得到了一个很好的哈希函数,但它似乎给出了 16 个桶(或者更确切地说,范围是 16)!

#define MIN_HASH_VALUE 3
#define MAX_HASH_VALUE 18
/* maximum key range = 16, duplicates = 0 */

查看生成的 gperf 代码,它的哈希函数代码执行了从 256 大小的表中查找 len 加 char 值的简单返回。不知何故,我在脑海中想象了一个看起来很奇特的功能...:)

如果我想要 12 个桶(也就是说我不想跳过未使用的桶)怎么办?对于这样的小集合,这确实不重要,但是当我有 1000 个预定义键并且想要连续 1000 个桶时怎么办?

人们能找到一种确定性的方法来做到这一点吗?

Let's say I want to build a perfect hash table for looking up an array where the predefined keys are 12 Months, thus I would want

hash("January")==0
hash("December")==11

I run my Month names through gperf and got a nice hash function, but it appears to give out 16 buckets(or rather the range is 16)!

#define MIN_HASH_VALUE 3
#define MAX_HASH_VALUE 18
/* maximum key range = 16, duplicates = 0 */

Looking at the generated gperf code, its hash function code does a simple return of len plus char value lookup from a 256 size table. Somehow, in my head I imagined a fancy looking function... :)

What if I want exactly 12 buckets(that is I do not want to skip over unused buckets)? For small sets as this, it really doesn't matter, but when I have 1000 predefined keys and want exactly 1000 buckets in a row?

Can one find a deterministic way to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

新人笑 2024-08-19 10:00:20

我对这个问题的答案很感兴趣&通过搜索 gperf 找到它。我尝试了 gperf,但它在大型输入文件上非常慢,因此似乎不合适。我尝试过 cmph 但我对此不满意。它需要构建一个文件,然后在运行时加载到 C 程序中。此外,该程序非常脆弱(任何类型的错误输入都会因“分段错误”而崩溃),以至于我不信任它。经过进一步的 Google 搜索,我找到了此页面,然后转到英里/小时。我下载了mph,发现它非常好。它有一个可选程序来生成 C 文件,称为“emitc”,使用它

 mph < systemdictionaryfile | emitc > output.c

几乎可以立即工作(大约 200,000 个单词的字典只需几秒钟),并创建一个可以正常编译的工作 C 文件。我的测试也表明它有效。不过,我还没有测试哈希算法的性能。

I was interested in the answer to this question & came to it via a search for gperf. I tried gperf, but it was very slow on a large input file and thus did not seem suitable. I tried cmph but I wasn't happy with it. It requires building a file which is then loaded into the C program at run time. Further, the program is so fragile (crashes with "segmentation fault" on any kind of mistaken input) that I did not trust it. A further Google search led me to this page, and onward to mph. I downloaded mph and found it is very nice. It has an optional program to generate a C file, called "emitc", and using it like

 mph < systemdictionaryfile | emitc > output.c

worked almost instantly (a few seconds with a dictionary of about 200,000 words) and created a working C file which compiles with no problems. My tests also indicate that it works. I haven't tested the performance of the hashing algorithm yet though.

紫瑟鸿黎 2024-08-19 10:00:20

我知道 gperf 的唯一替代方案是 cmph : http://cmph.sourceforge.net/ 但是, Jerome 在评论中说,拥有 16 个存储桶可以为您带来一些速度优势。

当我第一次看到最小完美哈希时,我在 上发现了非常有趣的读物CiteseerX 但我抵制住了尝试自己编写其中一个解决方案的诱惑。我知道我最终会得到一个相对于 gperf 或 cmph 而言较差的解决方案,或者即使假设该解决方案具有可比性,我也将不得不花费大量时间。

The only alternative to gperf I know is cmph : http://cmph.sourceforge.net/ but, as Jerome said in the comment, having 16 buckets provides you some speed benefit.

When I first looked at minimal perfect hasihing I found very interesting readings on CiteseerX but I resisted the temptation to try coding one of those solutions myself. I know I would end up with an inferior solution respect to gperf or cmph or, even assuming the solution was comparable, I would have to spend a lot of time on it.

自演自醉 2024-08-19 10:00:20

有很多 MPH 解决方案和算法,gperf 还没有做 MPH,但我正在研究它。特别是。对于大集合。请参阅 https://gitlab.com/rurban/gperf/-/tree/hashfuncs

经典的 cmph 具有大量恒定开销,仅建议用于大型密钥集。修复版本: https://github.com/rurban/cmph

NetBSD nbperf< /strong> 和我的改进变体: https://github.com/rurban/nbperf
它支持 CHM、CHM3 和 BZD,具有整数密钥支持、针对较小密钥集的优化和备用哈希函数。

Bob Jenkin 的 生成器,以及 Taj Khattra 的 mph-1.2

还有两个用于生成 C 查找的 perl 库,一个在 PostgresQL (PerfectHash.pm) 中,一个用于后期 perl5 unicode 查找 (regen/mph.pl),以及一个用于比较各种生成器的工具:https://github.com/rurban/Perfect-Hash

There are many MPH solutions and algorithms, gperf doesn't yet do MPH's, but I'm working on it. Esp. for large sets. See https://gitlab.com/rurban/gperf/-/tree/hashfuncs

The classic cmph has a lot of constant overhead and is only recommended for huge key sets. Fixed version: https://github.com/rurban/cmph

There's the NetBSD nbperf and my improved variant: https://github.com/rurban/nbperf
which does CHM, CHM3 and BZD, with integer key support, optimizations for smaller key sets and alternate hash functions.

There's Bob Jenkin's generator, and Taj Khattra's mph-1.2.

There are also two perl libraries to generate C lookups, one in PostgresQL (PerfectHash.pm) and one for late perl5 unicode lookups (regen/mph.pl), and a tool to compare various generators: https://github.com/rurban/Perfect-Hash

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文