比较两个排序的 int 数组
我有数百万个固定大小 (100) int 数组。每个数组都已排序并具有唯一的元素。对于每个数组,我想找到具有 70% 公共元素的所有数组。现在我每秒进行大约 100 万次比较(使用 Arrays.binarySearch()),这对我们来说太慢了。
谁能推荐一个更好的搜索算法?
I have millions of fixed-size (100) int arrays. Each array is sorted and has unique elements. For each array, I want to find all arrays which have 70% common elements. Right now I am getting around 1 million comparisons (using Arrays.binarySearch()) per second, which is too slow for us.
Can anyone recommend a better searching algorithm?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以尝试忽略重复项的合并排序。对于排序数组来说,这是一个 O(n) 操作。如果两个数组有 70% 的共同元素,则生成的集合将具有 130 个或更少的唯一整数。在您的情况下,您不需要结果,因此您可以只计算唯一条目的数量,并在达到 131 或两个数组的末尾时立即停止。
编辑 (2) 以下代码可以使用 4 个内核在大约 47 秒内完成约 1760 亿次比较。使用 4 个线程将代码设为多线程仅提高了 70%。
仅当 int 值的范围相当小时,使用 BitSet 才有效。否则你必须比较 int[] (如果你需要的话我已经留下了代码)
You could try a merge sort ignoring duplicates. This is an O(n) operation for sorted arrays. If the two arrays have 70% elements in common the resulting collection will have 130 or less unique ints. In your case you don't need the result so you can just count the number of unique entries and stop as soon as you reach 131 or the end of both arrays.
EDIT (2) The following code can do ~176 billion comparisions in about 47 seconds using 4 cores. Making the code multi-threaded with 4 cours was only 70% faster.
Using BitSet only works if the range of int values is fairly small. Otherwise you have to compare the int[] (I have left the code in should you need it)
您可以进行两种快速优化。
如果数组 A 的起始元素大于 B 的结束元素,则它们通常不能有公共元素。
另一个是类似三角不等式的东西:
其原因是(假设
f(A,B) > f(A,C)
)至少有f( A,B) - f(A,C)
元素同时存在于 A 和 B 但不在 C 中。这意味着它们不能是 B 和 C 的公共元素。There are two quick optimisations you can make.
If array A's start element is greater than B's end element, they trivially can't have common elements.
They other one is a triangle inequality-like thing:
The reason for this is that (assuming
f(A,B) > f(A,C)
) there are at leastf(A,B) - f(A,C)
elements that are in both A and B but not in C. Which means that they can't be common elements of B and C.像这样的东西应该可以完成这项工作(假设数组已排序并包含唯一元素):
示例用法:
输出:
Premature Optimizations™
我现在尝试优化上述代码。请检查标记为的代码块是否
有明显差异。优化后,我会重构代码,将其缩减为一两个返回语句。
Something like this should do the job (provided that the arrays are sorted and contain unique elements):
Sample usage:
Output:
Premature Optimizations™
I have now tried to optimize the above code. Please check whether the code blocks marked as
make a noticeable difference. After optimization, I would refactor the code to get it down to one or two return statements.