完美的哈希函数

发布于 2024-10-01 11:00:10 字数 238 浏览 2 评论 0原文

我正在尝试对值进行哈希处理,

10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0

我需要一个函数将它们映射到大小为 13 的数组,而不会引起任何冲突。

我花了几个小时思考这个问题并进行谷歌搜索,但无法弄清楚。我还没有找到可行的解决方案。

我将如何找到这种哈希函数?我玩过 gperf,但我不太理解它,也无法得到我想要的结果。

I'm attempting to hash the values

10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0

I need a function that will map them to an array that has a size of 13 without causing any collisions.

I've spent several hours thinking this over and googling and can't figure this out. I haven't come close to a viable solution.

How would I go about finding a hash function of this sort? I've played with gperf, but I don't really understand it and I couldn't get the results I was looking for.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

浪荡不羁 2024-10-08 11:00:10

如果您知道确切的密钥,那么生成完美的哈希函数就很简单 -

int hash (int n) {
  switch (n) {
    case 10:   return 0;
    case 100:  return 1;
    case 32:   return 2;
    // ...
    default:   return -1;
  }
}

if you know the exact keys then it is trivial to produce a perfect hash function -

int hash (int n) {
  switch (n) {
    case 10:   return 0;
    case 100:  return 1;
    case 32:   return 2;
    // ...
    default:   return -1;
  }
}
心凉怎暖 2024-10-08 11:00:10

找到一个

我尝试了一些东西,找到了一个半手动的:

(n ^ 28) % 13

半手动部分是以下 ruby​​ 脚本,我用它来测试具有一系列参数的候选函数:

t = [10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0]
(1..200).each do |i|
  t2 = t.map { |e| (e ^ i) % 13 }
  puts i if t2.uniq.length == t.length
end

Found One

I tried a few things and found one semi-manually:

(n ^ 28) % 13

The semi-manual part was the following ruby script that I used to test candidate functions with a range of parameters:

t = [10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0]
(1..200).each do |i|
  t2 = t.map { |e| (e ^ i) % 13 }
  puts i if t2.uniq.length == t.length
end
决绝 2024-10-08 11:00:10

在某些平台(例如嵌入式)上,模运算的成本很高,因此最好避免使用 % 13。但是低位的AND运算很便宜,并且相当于2的幂的模。

我尝试编写一个简单的程序(用 Python)来搜索 11 个数据点的完美哈希,使用简单的形式,例如 ((x << a) ^ (x << b)) & ; 0xF(其中 & 0xF 相当于 % 16,例如给出 0..15 范围内的结果)。我能够找到以下无冲突哈希,它给出了 0..15 范围内的索引(表示为 C 宏):

#define HASH(x)    ((((x) << 2) ^ ((x) >> 2)) & 0xF)

这是我使用的 Python 程序:

data = [ 10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0 ]

def shift_right(value, shift_value):
    """Shift right that allows for negative values, which shift left
    (Python shift operator doesn't allow negative shift values)"""
    if shift_value == None:
        return 0
    if shift_value < 0:
        return value << (-shift_value)
    else:
        return value >> shift_value

def find_hash():
    def hashf(val, i, j = None, k = None):
        return (shift_right(val, i) ^ shift_right(val, j) ^ shift_right(val, k)) & 0xF

    for i in xrange(-7, 8):
        for j in xrange(i, 8):
            #for k in xrange(j, 8):
                #j = None
                k = None
                outputs = set()
                for val in data:
                    hash_val = hashf(val, i, j, k)
                    if hash_val >= 13:
                        pass
                        #break
                    if hash_val in outputs:
                        break
                    else:
                        outputs.add(hash_val)
                else:
                    print i, j, k, outputs

if __name__ == '__main__':
    find_hash()

On some platforms (e.g. embedded), modulo operation is expensive, so % 13 is better avoided. But AND operation of low-order bits is cheap, and equivalent to modulo of a power-of-2.

I tried writing a simple program (in Python) to search for a perfect hash of your 11 data points, using simple forms such as ((x << a) ^ (x << b)) & 0xF (where & 0xF is equivalent to % 16, giving a result in the range 0..15, for example). I was able to find the following collision-free hash which gives an index in the range 0..15 (expressed as a C macro):

#define HASH(x)    ((((x) << 2) ^ ((x) >> 2)) & 0xF)

Here is the Python program I used:

data = [ 10, 100, 32, 45, 58, 126, 3, 29, 200, 400, 0 ]

def shift_right(value, shift_value):
    """Shift right that allows for negative values, which shift left
    (Python shift operator doesn't allow negative shift values)"""
    if shift_value == None:
        return 0
    if shift_value < 0:
        return value << (-shift_value)
    else:
        return value >> shift_value

def find_hash():
    def hashf(val, i, j = None, k = None):
        return (shift_right(val, i) ^ shift_right(val, j) ^ shift_right(val, k)) & 0xF

    for i in xrange(-7, 8):
        for j in xrange(i, 8):
            #for k in xrange(j, 8):
                #j = None
                k = None
                outputs = set()
                for val in data:
                    hash_val = hashf(val, i, j, k)
                    if hash_val >= 13:
                        pass
                        #break
                    if hash_val in outputs:
                        break
                    else:
                        outputs.add(hash_val)
                else:
                    print i, j, k, outputs

if __name__ == '__main__':
    find_hash()
过期情话 2024-10-08 11:00:10

Bob Jenkins 也有一个程序:http://burtleburtle.net/bob/hash/perfect。 html

除非你非常幸运,否则对于给定的数据集没有“好的”完美哈希函数。完美的哈希算法通常在键上使用简单的哈希函数(使用足够的位,因此不会发生冲突),然后使用表来完成它。

Bob Jenkins has a program for this too: http://burtleburtle.net/bob/hash/perfect.html

Unless you're very lucky, there's no "nice" perfect hash function for a given dataset. Perfect hashing algorithms usually use a simple hashing function on the keys (using enough bits so it's collision-free) then use a table to finish it off.

可是我不能没有你 2024-10-08 11:00:10

只是一些准分析的胡言乱语:

在你的一组数字中,总共十一个,其中三个是奇数,八个是偶数。
查看最简单的散列形式 - %13 - 将为您提供以下散列值:
10 - 3,
100 - 9,
32 - 6,
45 - 6,
58 - 6,
126 - 9,
3 - 3,
29 - 3,
200 - 5,
400 - 10,
0 - 0

当然,由于碰撞次数过多,这是不可用的。需要更详细的东西。

为什么要说显而易见的事情呢?
考虑到数字太少,任何复杂的 - 或者更确切地说,“不太简单” - 算法可能会比 switch 语句或(我更喜欢)简单地搜索大小为 11 个位置的无符号短/长向量并使用比赛的索引。

为什么要使用矢量搜索?

  1. 您可以通过将最常出现的值放置在向量的开头来对其进行微调。
  2. 我认为目的是将散列索引插入到具有良好顺序编号的开关中。从这个角度来看,首先使用一个开关来查找索引,然后将其插入另一个开关似乎很浪费。也许您应该考虑根本不使用散列并直接进入最后的开关?
  3. 哈希的 switch 版本无法进行微调,并且由于值差异很大,将导致编译器生成二叉搜索树,这将导致大量比较和条件/其他跳转(尤其昂贵),这需要时间(我假设您已经转向散列(因为它的速度)并且需要空间。
  4. 如果您想另外加速向量搜索并且使用 x86 系统,您可以基于汇编器指令 repne scasw(短)/repne scasd(长)实现向量搜索,这会快得多。经过几条指令的设置时间后,您将找到一条指令中的第一个条目和十一条指令中的最后一个条目,然后是一些指令清理。这意味着最好情况下需要 5-10 条指令,最坏情况下需要 15-20 条指令。除了一两种情况外,这应该在所有情况下都击败基于交换机的散列。

Just some quasi-analytical ramblings:

In your set of numbers, eleven in all, three are odd and eight are even.
Looking at the simplest forms of hashing - %13 - will give you the following hash values:
10 - 3,
100 - 9,
32 - 6,
45 - 6,
58 - 6,
126 - 9,
3 - 3,
29 - 3,
200 - 5,
400 - 10,
0 - 0

Which, of course, is unusable due to the number of collisions. Something more elaborate is needed.

Why state the obvious?
Considering that the numbers are so few any elaborate - or rather, "less simple" - algorithm will likely be slower than either the switch statement or (which I prefer) simply searching through an unsigned short/long vector of size eleven positions and using the index of the match.

Why use a vector search?

  1. You can fine-tune it by placing the most often occuring values towards the beginning of the vector.
  2. I assume the purpose is to plug in the hash index into a switch with nice, sequential numbering. In that light it seems wasteful to first use a switch to find the index and then plug it into another switch. Maybe you should consider not using hashing at all and go directly to the final switch?
  3. The switch version of hashing cannot be fine-tuned and, due to the widely differing values, will cause the compiler to generate a binary search tree which will result in a lot of comparisons and conditional/other jumps (especially costly) which take time (I've assumed you've turned to hashing for its speed) and require space.
  4. If you want to speed up the vector search additionally and are using an x86-system you can implement a vector search based on the assembler instructions repne scasw (short)/repne scasd (long) which will be much faster. After a setup time of a few instructions you will find the first entry in one instruction and the last in eleven followed by a few instructions cleanup. This means 5-10 instructions best case and 15-20 worst. This should beat the switch-based hashing in all but maybe one or two cases.
墟烟 2024-10-08 11:00:10

当我在 Mathematica 中尝试时,我进行了快速检查并使用 SHA256 哈希函数,然后进行模除以 13 。对于 C++,此函数应该位于 openssl 库中。请参阅此帖子

如果您进行了大量的散列和查找,则重复执行模除操作是一项相当昂贵的操作。还有另一种将 n 位哈希函数映射到 i 位索引的方法。请参阅此帖子 Michael Mitzenmacher 介绍了如何使用 C 语言进行位移操作。希望有所帮助。

I did a quick check and using the SHA256 hash function and then doing modular division by 13 worked when I tried it in Mathematica. For c++ this function should be in the openssl library. See this post.

If you were doing a lot of hashing and lookup though, modular division is a pretty expensive operation to do repeatedly. There is another way of mapping an n-bit hash function into a i-bit indices. See this post by Michael Mitzenmacher about how to do it with a bit shift operation in C. Hope that helps.

青衫负雪 2024-10-08 11:00:10

尝试以下将您的 n 值映射到 0 到 12 之间的唯一索引
(1369%(n+1))%13

Try the following which maps your n values to unique indices between 0 and 12
(1369%(n+1))%13

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文