确定 Pearson 哈希的完美哈希查找表

发布于 2024-08-04 09:20:36 字数 1194 浏览 3 评论 0原文

我正在开发一种编程语言,在我的编程语言中,我将对象存储为哈希表。我使用的哈希函数是 Pearson Hashing,它依赖于 256 位查找表。函数如下:

char* pearson(char* name, char* lookup)
{
    char index = '\0';
    while(*name)
    {
        index = lookup[index ^ *name];
        name++;
    }
    return index;
}

我的问题是,给定一个少于 256 个成员名称的固定组,如何确定一个 查找 表,以便 pearson() 将返回其中的唯一字符从 '\0' 开始的连续范围。换句话说,我需要一种算法来为完美哈希创建查找表。这将使我的对象占用的空间不超过其成员的数量。这将在编译时完成,因此速度不是一个大问题,但越快越好。暴力破解很容易,但我认为(希望)有更好的方法。

这是一个示例:给定类中的成员变量“foo”、“bar”和“baz”,我想确定一个查找,以便:

pearson('foo',lookup) == (char) 0
pearson('bar',lookup) == (char) 1
pearson('baz',lookup) == (char) 2

请注意,顺序并不重要,因此以下结果也是可以接受的:

pearson('foo',lookup) == (char) 2
pearson('bar',lookup) == (char) 0
pearson('baz',lookup) == (char) 1

在理想的情况下,所有不在表中的名称都会返回大于 2 的值,因为这将允许我避免检查,甚至可能避免存储成员名称,但我不这样做不认为这是可能的,所以我必须添加额外的检查来查看它是否在表中。鉴于此,不初始化查找表中未使用的值可能会节省时间(碰撞并不重要,因为如果它碰撞并且检查失败,它根本不在对象中,所以碰撞不需要解决;只需处理错误)。

I'm developing a programming language, and in my programming language, I'm storing objects as hash tables. The hash function I'm using is Pearson Hashing, which depends on a 256-bit lookup table. Here's the function:

char* pearson(char* name, char* lookup)
{
    char index = '\0';
    while(*name)
    {
        index = lookup[index ^ *name];
        name++;
    }
    return index;
}

My question is, given a fixed group of fewer than 256 member names, how can one determine a lookup table such that pearson() will return unique characters within a contiguous range starting from '\0'. In other words, I need an algorithm to create a lookup table for a perfect hash. This will allow me to have objects that take up no more space than the number of their members. This will be done at compile time, so speed isn't a huge concern, but faster would be better. It would be easy to brute force this, but I think (hope) there's a better way.

Here's an example: given member variables 'foo', 'bar', and 'baz' in a class, I want to determine a lookup such that:

pearson('foo',lookup) == (char) 0
pearson('bar',lookup) == (char) 1
pearson('baz',lookup) == (char) 2

Note that the order doesn't matter, so the following result would also be acceptable:

pearson('foo',lookup) == (char) 2
pearson('bar',lookup) == (char) 0
pearson('baz',lookup) == (char) 1

In an ideal world, all names that aren't in the table would return a value greater than 2 because this would allow me to avoid a check and possibly even avoid storing the member names, but I don't think this is possible, so I'll have to add an extra check to see if it's in the table. Given this, it probably would save time to not initialize values in the lookup table which aren't used (collisions don't matter, because if it collides and fails the check, it isn't in the object at all, so the collision doesn't need to be resolved; only the error needs to be handled).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦里泪两行 2024-08-11 09:20:36

我强烈怀疑如果成员名称的数量太多,您是否能够通过暴力找到解决方案。由于生日悖论,不存在冲突的概率(即,两个哈希值相同)对于 64 个成员名称约为 1:5000,对于 96 个成员名称约为 1:850,000,000。从哈希函数的结构来看(它源自旨在很好地“混合”事物的加密构造),我不期望存在解决您的问题的算法(但我肯定会对这样的野兽感兴趣)。

您的理想世界是一个幻象(正如您所期望的):您可以将 256 个字符附加到“foo”后,其中没有两个字符会给出具有相同哈希值的新单词。由于哈希值只有 256 种可能性,因此您可以将一个字符附加到“foo”,使其哈希值与“foo”、“bar”或“baz”的任何哈希值相同。

为什么不使用现有的库,例如 CMPH

I strongly doubt that you will be able to find a solution with brute force if the number of member names is too high. Thanks to the birthday paradox the probability that no collisions exist (i.e., two hashes are the same) is approximately 1:5000 for 64 and 1:850,000,000 for 96 member names. From the structure of your hash function (it's derived from a cryptographic construction that is designed to "mix" things well) I don't expect that an algorithms exists that solves your problem (but I would definitely be interested in such a beast).

Your ideal world is an illusion (as you expected): there are 256 characters you can append to 'foo', no two of them giving a new word with a same hash. As there are only 256 possibilities for the hash values, you can therefore append a character to 'foo' so that its hash is the same as any of the hashes of 'foo', 'bar' or 'baz'.

Why don't you use an existing library like CMPH?

没有伤那来痛 2024-08-11 09:20:36

如果我理解正确的话,您需要的是一个排序且无重复元素的数组,您可以对其进行二分搜索。如果键在数组中,则索引就是“散列”。否则,您将获得数组的大小。与查找表 O(1) 相比,它的时间复杂度为 O(nlogn),但对于少量元素(在您的情况下为 256)来说已经足够了。

If I understand you correctly, what you need is an sorted and no-duplicated-element array that you can do binary search on. If the key is in the array, the index is the "hash". Otherwise, you get the size of the array. It is O(nlogn) compares to lookup table O(1), but it is good enough for small number of elements - 256 in your case.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文