好的排列哈希函数?

发布于 2024-08-07 06:25:34 字数 828 浏览 1 评论 0原文

我有特定范围内的数字(通常从 0 到大约 1000)。算法会从此范围中选择一些数字(大约 3 到 10 个数字)。这种选择经常进行,我需要检查是否已经选择了所选数字的排列。

例如,一个步骤选择[1, 10, 3, 18],另一个步骤选择[10, 18, 3, 1],那么第二个选择可以被丢弃,因为它是一个排列。

我需要非常快地进行这项检查。现在,我将所有数组放入哈希图中,并使用自定义哈希函数:只需将所有元素相加,即 1+10+3+18=32,还有 10+18+3+1=32。对于 equals,我使用位集来快速检查元素是否在两个集合中(使用位集时我不需要排序,但它仅在数字范围已知且不太大时才有效)。

这工作正常,但会产生大量冲突,因此 equals() 方法被频繁调用。我想知道是否有更快的方法来检查排列?

有没有好的哈希函数用于排列?

更新

我做了一些基准测试:生成 0 到 6 范围内的数字和数组长度 1 到 9 之间的所有组合。有 3003 种可能的排列,一个好的哈希应该生成接近这么多的排列不同的哈希值(我使用 32 位数字作为哈希值):

  • 41 个不同的哈希值用于相加(因此存在很多冲突)
  • 8 个不同的哈希值用于将值异或在一起
  • 相乘
  • 286 个不同的哈希值用于将3003 个不同的哈希值 (R + 2e)并按照 abc 建议的方式相乘(对于 R 使用 1779033703),

因此 abc 的哈希可以计算得非常快,并且比其他所有哈希都要好得多。谢谢!

PS:我不想在不必要的时候对值进行排序,因为这会变得太慢。

I have got numbers in a specific range (usually from 0 to about 1000). An algorithm selects some numbers from this range (about 3 to 10 numbers). This selection is done quite often, and I need to check if a permutation of the chosen numbers has already been selected.

e.g one step selects [1, 10, 3, 18] and another one [10, 18, 3, 1] then the second selection can be discarded because it is a permutation.

I need to do this check very fast. Right now I put all arrays in a hashmap, and use a custom hash function: just sums up all the elements, so 1+10+3+18=32, and also 10+18+3+1=32. For equals I use a bitset to quickly check if elements are in both sets (I do not need sorting when using the bitset, but it only works when the range of numbers is known and not too big).

This works ok, but can generate lots of collisions, so the equals() method is called quite often. I was wondering if there is a faster way to check for permutations?

Are there any good hash functions for permutations?

UPDATE

I have done a little benchmark: generate all combinations of numbers in the range 0 to 6, and array length 1 to 9. There are 3003 possible permutations, and a good hash should generated close to this many different hashes (I use 32 bit numbers for the hash):

  • 41 different hashes for just adding (so there are lots of collisions)
  • 8 different hashes for XOR'ing values together
  • 286 different hashes for multiplying
  • 3003 different hashes for (R + 2e) and multiplying as abc has suggested (using 1779033703 for R)

So abc's hash can be calculated very fast and is a lot better than all the rest. Thanks!

PS: I do not want to sort the values when I do not have to, because this would get too slow.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

暮凉 2024-08-14 06:25:34

一个潜在的候选人可能是这样的。
固定一个奇数R。
对于每个元素 e,您要散列计算因子 (R + 2*e)。
然后计算所有这些因素的乘积。
最后将乘积除以 2 即可得到哈希值。

(R + 2e) 中的因子 2 保证所有因子都是奇数,因此避免
乘积永远会变成 0。最后除以 2 是因为
乘积总是奇数,因此除法只是删除一个常数位。

例如,我选择 R = 1779033703。这是一个任意选择,做一些实验应该可以显示给定的 R 是好是坏。假设您的值为 [1, 10, 3, 18]。
产品(使用 32 位整数计算)是

(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311

因此散列将是

3376724311/2 = 1688362155。

One potential candidate might be this.
Fix a odd integer R.
For each element e you want to hash compute the factor (R + 2*e).
Then compute the product of all these factors.
Finally divide the product by 2 to get the hash.

The factor 2 in (R + 2e) guarantees that all factors are odd, hence avoiding
that the product will ever become 0. The division by 2 at the end is because
the product will always be odd, hence the division just removes a constant bit.

E.g. I choose R = 1779033703. This is an arbitrary choice, doing some experiments should show if a given R is good or bad. Assume your values are [1, 10, 3, 18].
The product (computed using 32-bit ints) is

(R + 2) * (R + 20) * (R + 6) * (R + 36) = 3376724311

Hence the hash would be

3376724311/2 = 1688362155.

明天过后 2024-08-14 06:25:34

对元素求和已经是您可以做的最简单的事情之一。但我不认为这对于伪随机性来说是一个特别好的哈希函数。

如果您在存储数组或计算哈希值之前对数组进行排序,那么每个好的哈希函数都可以。

如果与速度有关:您是否测量过瓶颈在哪里?如果你的哈希函数给你带来了很多冲突,并且你必须花费大部分时间逐位比较数组,那么哈希函数显然不擅长它应该做的事情。排序+更好的哈希可能是解决方案。

Summing the elements is already one of the simplest things you could do. But I don't think it's a particularly good hash function w.r.t. pseudo randomness.

If you sort your arrays before storing them or computing hashes, every good hash function will do.

If it's about speed: Have you measured where the bottleneck is? If your hash function is giving you a lot of collisions and you have to spend most of the time comparing the arrays bit-by-bit the hash function is obviously not good at what it's supposed to do. Sorting + Better Hash might be the solution.

羞稚 2024-08-14 06:25:34

如果我正确理解你的问题,你想测试项目未排序的集合之间的相等性。这正是布隆过滤器将为您做的事情。以少量误报为代价(在这种情况下,您需要调用强力集比较),您将能够通过检查它们的布隆过滤器哈希是否相等来比较这些集。

这个成立的代数原因是 OR 运算是可交换的。这也适用于其他半环。

If I understand your question correctly you want to test equality between sets where the items are not ordered. This is precisely what a Bloom filter will do for you. At the expense of a small number of false positives (in which case you'll need to make a call to a brute-force set comparison) you'll be able to compare such sets by checking whether their Bloom filter hash is equal.

The algebraic reason why this holds is that the OR operation is commutative. This holds for other semirings, too.

夜声 2024-08-14 06:25:34

取决于是否有很多冲突(因此相同的散列但不是排列),您可以在对数组进行散列时对它们进行预排序。在这种情况下,您可以进行更激进的散列,不仅将数字相加,还添加一些位魔法以获得完全不同的散列。

仅当您遇到大量不需要的冲突时,这才有用,因为您现在所做的哈希太差了。如果您几乎没有发生任何碰撞,那么您使用的方法似乎不错

depending if you have a lot of collisions (so the same hash but not a permutation), you might presort the arrays while hashing them. In that case you can do a more aggressive kind of hashing where you don't only add up the numbers but add some bitmagick to it as well to get quite different hashes.

This is only beneficial if you get loads of unwanted collisions because the hash you are doing now is too poor. If you hardly get any collisions, the method you are using seems fine

思念绕指尖 2024-08-14 06:25:34

我建议这样:
1. 检查排列的长度是否相同(如果不同 - 它们不相等)

  1. 仅对 1 个数组进行排序。不是对另一个数组进行排序,而是迭代第一个数组的元素并搜索第二个数组中每个元素是否存在(仅在第二个数组中的元素较小时进行比较 - 不要迭代整个数组)。

注意:如果您的排列中可以有相同的数字(例如 [1,2,2,10]),那么当第二个数组与第一个数组中的成员匹配时,您将需要从第二个数组中删除元素。

伪代码:

if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;

这个想法是,我们可以尝试匹配已排序数组中的所有元素,而不是对另一个数组进行排序。

I would suggest this:
1. Check if the lengths of permutations are the same (if not - they are not equal)

  1. Sort only 1 array. Instead of sorting another array iterate through the elements of the 1st array and search for the presence of each of them in the 2nd array (compare only while the elements in the 2nd array are smaller - do not iterate through the whole array).

note: if you can have the same numbers in your permutaions (e.g. [1,2,2,10]) then you will need to remove elements from the 2nd array when it matches a member from the 1st one.

pseudo-code:

if length(arr1) <> length(arr2) return false;
sort(arr2);
for i=1 to length(arr1) {
elem=arr1[i];
j=1;
while (j<=length(arr2) and elem<arr2[j]) j=j+1;
if elem <> arr2[j] return false;
}
return true;

the idea is that instead of sorting another array we can just try to match all of its elements in the sorted array.

遥远的她 2024-08-14 06:25:34

通过使用乘积以及项之和,您可能可以大大减少冲突。

1*10*3*18=540 和 10*18*3*1=540

所以和积哈希将是 [32,540]

你仍然需要在碰撞发生时采取一些措施

You can probably reduce the collisions a lot by using the product as well as the sum of the terms.

1*10*3*18=540 and 10*18*3*1=540

so the sum-product hash would be [32,540]

you still need to do something about collisions when they do happen though

樱&纷飞 2024-08-14 06:25:34

我喜欢使用字符串的默认哈希码(Java、C# 不确定其他语言),它生成非常独特的哈希码。
因此,如果您首先对数组进行排序,然后使用某些分隔符生成一个唯一的字符串。

所以你可以执行以下操作(Java):

    int[] arr = selectRandomNumbers();
    Arrays.sort(arr);
    int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();

如果性能是一个问题,你可以更改建议的低效字符串连接以使用 StringBuilder 或 String.format

   String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);

字符串哈希代码当然不能保证两个不同的字符串具有不同的哈希值,但考虑到这个建议的格式,冲突应该是极其罕见的

I like using string's default hash code (Java, C# not sure about other languages), it generates pretty unique hash codes.
so if you first sort the array, and then generates a unique string using some delimiter.

so you can do the following (Java):

    int[] arr = selectRandomNumbers();
    Arrays.sort(arr);
    int hash = (arr[0] + "," + arr[1] + "," + arr[2] + "," + arr[3]).hashCode();

if performance is an issue, you can change the suggested inefficient string concatenation to use StringBuilder or String.format

   String.format("{0},{1},{2},{3}", arr[0],arr[1],arr[2],arr[3]);

String hash code of course doesn't guarantee that two distinct strings have different hash, but considering this suggested formatting, collisions should be extremely rare

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文