确定 Pearson 哈希的完美哈希查找表

发布于 2024-08-04 09:20:36 字数 1194 浏览 3 评论 0原文

我正在开发一种编程语言，在我的编程语言中，我将对象存储为哈希表。我使用的哈希函数是 Pearson Hashing，它依赖于 256 位查找表。函数如下：

char* pearson(char* name, char* lookup)
{
    char index = '\0';
    while(*name)
    {
        index = lookup[index ^ *name];
        name++;
    }
    return index;
}

我的问题是，给定一个少于 256 个成员名称的固定组，如何确定一个 查找 表，以便 pearson() 将返回其中的唯一字符从 '\0' 开始的连续范围。换句话说，我需要一种算法来为完美哈希创建查找表。这将使我的对象占用的空间不超过其成员的数量。这将在编译时完成，因此速度不是一个大问题，但越快越好。暴力破解很容易，但我认为（希望）有更好的方法。

这是一个示例：给定类中的成员变量“foo”、“bar”和“baz”，我想确定一个查找，以便：

pearson('foo',lookup) == (char) 0
pearson('bar',lookup) == (char) 1
pearson('baz',lookup) == (char) 2

请注意，顺序并不重要，因此以下结果也是可以接受的：

pearson('foo',lookup) == (char) 2
pearson('bar',lookup) == (char) 0
pearson('baz',lookup) == (char) 1

在理想的情况下，所有不在表中的名称都会返回大于 2 的值，因为这将允许我避免检查，甚至可能避免存储成员名称，但我不这样做不认为这是可能的，所以我必须添加额外的检查来查看它是否在表中。鉴于此，不初始化查找表中未使用的值可能会节省时间（碰撞并不重要，因为如果它碰撞并且检查失败，它根本不在对象中，所以碰撞不需要解决；只需处理错误）。

原文

I'm developing a programming language, and in my programming language, I'm storing objects as hash tables. The hash function I'm using is Pearson Hashing, which depends on a 256-bit lookup table. Here's the function:

char* pearson(char* name, char* lookup)
{
    char index = '\0';
    while(*name)
    {
        index = lookup[index ^ *name];
        name++;
    }
    return index;
}

My question is, given a fixed group of fewer than 256 member names, how can one determine a lookup table such that pearson() will return unique characters within a contiguous range starting from '\0'. In other words, I need an algorithm to create a lookup table for a perfect hash. This will allow me to have objects that take up no more space than the number of their members. This will be done at compile time, so speed isn't a huge concern, but faster would be better. It would be easy to brute force this, but I think (hope) there's a better way.

Here's an example: given member variables 'foo', 'bar', and 'baz' in a class, I want to determine a lookup such that:

pearson('foo',lookup) == (char) 0
pearson('bar',lookup) == (char) 1
pearson('baz',lookup) == (char) 2

Note that the order doesn't matter, so the following result would also be acceptable:

pearson('foo',lookup) == (char) 2
pearson('bar',lookup) == (char) 0
pearson('baz',lookup) == (char) 1

In an ideal world, all names that aren't in the table would return a value greater than 2 because this would allow me to avoid a check and possibly even avoid storing the member names, but I don't think this is possible, so I'll have to add an extra check to see if it's in the table. Given this, it probably would save time to not initialize values in the lookup table which aren't used (collisions don't matter, because if it collides and fails the check, it isn't in the object at all, so the collision doesn't need to be resolved; only the error needs to be handled).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里泪两行 2024-08-11 09:20:36

我强烈怀疑如果成员名称的数量太多，您是否能够通过暴力找到解决方案。由于生日悖论，不存在冲突的概率（即，两个哈希值相同）对于 64 个成员名称约为 1:5000，对于 96 个成员名称约为 1:850,000,000。从哈希函数的结构来看（它源自旨在很好地“混合”事物的加密构造），我不期望存在解决您的问题的算法（但我肯定会对这样的野兽感兴趣）。

您的理想世界是一个幻象（正如您所期望的）：您可以将 256 个字符附加到“foo”后，其中没有两个字符会给出具有相同哈希值的新单词。由于哈希值只有 256 种可能性，因此您可以将一个字符附加到“foo”，使其哈希值与“foo”、“bar”或“baz”的任何哈希值相同。

为什么不使用现有的库，例如 CMPH？

回复收藏 0 原文