比较两个排序的 int 数组

发布于 2024-10-13 22:29:30 字数 154 浏览 6 评论 0原文

我有数百万个固定大小 (100) int 数组。每个数组都已排序并具有唯一的元素。对于每个数组,我想找到具有 70% 公共元素的所有数组。现在我每秒进行大约 100 万次比较(使用 Arrays.binarySearch()),这对我们来说太慢了。

谁能推荐一个更好的搜索算法?

I have millions of fixed-size (100) int arrays. Each array is sorted and has unique elements. For each array, I want to find all arrays which have 70% common elements. Right now I am getting around 1 million comparisons (using Arrays.binarySearch()) per second, which is too slow for us.

Can anyone recommend a better searching algorithm?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

月依秋水 2024-10-20 22:29:30

您可以尝试忽略重复项的合并排序。对于排序数组来说,这是一个 O(n) 操作。如果两个数组有 70% 的共同元素,则生成的集合将具有 130 个或更少的唯一整数。在您的情况下,您不需要结果,因此您可以只计算唯一条目的数量,并在达到 131 或两个数组的末尾时立即停止。

编辑 (2) 以下代码可以使用 4 个内核在大约 47 秒内完成约 1760 亿次比较。使用 4 个线程将代码设为多线程仅提高了 70%。

仅当 int 值的范围相当小时,使用 BitSet 才有效。否则你必须比较 int[] (如果你需要的话我已经留下了代码)

在 47.712 秒内进行了 176,467,034,428 次比较,找到了 444,888 个匹配项

public static void main(String... args) throws InterruptedException {
    int length = 100;
    int[][] ints = generateArrays(50000, length);
    final BitSet[] bitSets = new BitSet[ints.length];
    for(int i=0;i<ints.length;i++) {
        int[] ia = ints[i];
        BitSet bs = new BitSet(ia[ia.length-1]);
        for (int i1 : ia)
            bs.set(i1);
        bitSets[i] = bs;
    }

    final AtomicInteger matches = new AtomicInteger();
    final AtomicLong comparisons = new AtomicLong();
    int nThreads = Runtime.getRuntime().availableProcessors();
    ExecutorService executorService = Executors.newFixedThreadPool(nThreads);

    long start = System.nanoTime();
    for (int i = 0; i < bitSets.length - 1; i++) {
        final int finalI = i;
        executorService.submit(new Runnable() {
            public void run() {
                for (int j = finalI + 1; j < bitSets.length; j++) {
                    int compare = compare(bitSets[finalI], bitSets[j]);
                    if (compare <= 130)
                        matches.incrementAndGet();
                    comparisons.addAndGet(compare);
                }
            }
        });
    }
    executorService.shutdown();
    executorService.awaitTermination(1, TimeUnit.HOURS);
    long time = System.nanoTime() - start;
    System.out.printf("Peformed %,d comparisons in %.3f seconds and found %,d matches %n",comparisons.longValue(),time/1e9, matches.intValue());
}

private static int[][] generateArrays(int count, int length) {
    List<Integer> rawValues = new ArrayList<Integer>(170);
    for (int i = 0; i < 170; i++)
        rawValues.add(i);

    int[][] ints = new int[count][length];
    Random rand = new Random(1);
    for (int[] ia : ints) {
        Collections.shuffle(rawValues, rand);
        for (int i = 0; i < ia.length; i++)
            ia[i] = (int) (int) rawValues.get(i);
        Arrays.sort(ia);
    }
    return ints;
}

private static int compare(int[] ia, int[] ja) {
    int count = 0;
    int i=0,j=0;
    while(i<ia.length && j<ja.length) {
        int iv = ia[i];
        int jv = ja[j];
        if (iv < jv) {
            i++;
        } else if (iv > jv) {
            j++;
        } else {
            count++; // duplicate
            i++;
            j++;
        }
    }
    return ia.length + ja.length - count;
}
private static int compare(BitSet ia, BitSet ja) {
    BitSet both = new BitSet(Math.max(ia.length(), ja.length()));
    both.or(ia);
    both.or(ja);
    return both.cardinality();
}

You could try a merge sort ignoring duplicates. This is an O(n) operation for sorted arrays. If the two arrays have 70% elements in common the resulting collection will have 130 or less unique ints. In your case you don't need the result so you can just count the number of unique entries and stop as soon as you reach 131 or the end of both arrays.

EDIT (2) The following code can do ~176 billion comparisions in about 47 seconds using 4 cores. Making the code multi-threaded with 4 cours was only 70% faster.

Using BitSet only works if the range of int values is fairly small. Otherwise you have to compare the int[] (I have left the code in should you need it)

Peformed 176,467,034,428 comparisons in 47.712 seconds and found 444,888 matches

public static void main(String... args) throws InterruptedException {
    int length = 100;
    int[][] ints = generateArrays(50000, length);
    final BitSet[] bitSets = new BitSet[ints.length];
    for(int i=0;i<ints.length;i++) {
        int[] ia = ints[i];
        BitSet bs = new BitSet(ia[ia.length-1]);
        for (int i1 : ia)
            bs.set(i1);
        bitSets[i] = bs;
    }

    final AtomicInteger matches = new AtomicInteger();
    final AtomicLong comparisons = new AtomicLong();
    int nThreads = Runtime.getRuntime().availableProcessors();
    ExecutorService executorService = Executors.newFixedThreadPool(nThreads);

    long start = System.nanoTime();
    for (int i = 0; i < bitSets.length - 1; i++) {
        final int finalI = i;
        executorService.submit(new Runnable() {
            public void run() {
                for (int j = finalI + 1; j < bitSets.length; j++) {
                    int compare = compare(bitSets[finalI], bitSets[j]);
                    if (compare <= 130)
                        matches.incrementAndGet();
                    comparisons.addAndGet(compare);
                }
            }
        });
    }
    executorService.shutdown();
    executorService.awaitTermination(1, TimeUnit.HOURS);
    long time = System.nanoTime() - start;
    System.out.printf("Peformed %,d comparisons in %.3f seconds and found %,d matches %n",comparisons.longValue(),time/1e9, matches.intValue());
}

private static int[][] generateArrays(int count, int length) {
    List<Integer> rawValues = new ArrayList<Integer>(170);
    for (int i = 0; i < 170; i++)
        rawValues.add(i);

    int[][] ints = new int[count][length];
    Random rand = new Random(1);
    for (int[] ia : ints) {
        Collections.shuffle(rawValues, rand);
        for (int i = 0; i < ia.length; i++)
            ia[i] = (int) (int) rawValues.get(i);
        Arrays.sort(ia);
    }
    return ints;
}

private static int compare(int[] ia, int[] ja) {
    int count = 0;
    int i=0,j=0;
    while(i<ia.length && j<ja.length) {
        int iv = ia[i];
        int jv = ja[j];
        if (iv < jv) {
            i++;
        } else if (iv > jv) {
            j++;
        } else {
            count++; // duplicate
            i++;
            j++;
        }
    }
    return ia.length + ja.length - count;
}
private static int compare(BitSet ia, BitSet ja) {
    BitSet both = new BitSet(Math.max(ia.length(), ja.length()));
    both.or(ia);
    both.or(ja);
    return both.cardinality();
}
薄暮涼年 2024-10-20 22:29:30

您可以进行两种快速优化。

如果数组 A 的起始元素大于 B 的结束元素,则它们通常不能有公共元素。

另一个是类似三角不等式的东西:

f(B,C) <= 100 - |f(A,B)-f(A,C)|

其原因是(假设f(A,B) > f(A,C))至少有f( A,B) - f(A,C) 元素同时存在于 A 和 B 但不在 C 中。这意味着它们不能是 B 和 C 的公共元素。

There are two quick optimisations you can make.

If array A's start element is greater than B's end element, they trivially can't have common elements.

They other one is a triangle inequality-like thing:

f(B,C) <= 100 - |f(A,B)-f(A,C)|

The reason for this is that (assuming f(A,B) > f(A,C)) there are at least f(A,B) - f(A,C) elements that are in both A and B but not in C. Which means that they can't be common elements of B and C.

今天小雨转甜 2024-10-20 22:29:30

像这样的东西应该可以完成这项工作(假设数组已排序并包含唯一元素):

public static boolean atLeastNMatchingElements(final int n,
    final int[] arr1,
    final int[] arr2){

    /* check assumptions */
    assert (arr1.length == arr2.length);

    final int arrLength = arr2.length;

    { /* optimization */
        final int maxOffset = Math.max(arrLength - n, 0);
        if(arr1[maxOffset] < arr2[0] || arr2[maxOffset] < arr1[0]){
            return false;
        }
    }

    int arr2Offset = 0;
    int matches = 0;

    /* declare variables only once, outside loop */
    int compResult; int candidate;

    for(int i = 0; i < arrLength; i++){
        candidate = arr1[i];
        while(arr2Offset < arrLength - 1){
            compResult = arr2[arr2Offset] - candidate;
            if(compResult > 0){
                break;
            } else{
                arr2Offset++;
                if(compResult == 0){
                    matches++;
                    break;
                }
            }
        }
        if(matches == n){
            return true;
        }
        /* optimization */
        else if(matches < n - (arrLength - arr2Offset)){
            return false;
        }
    }
    return false;
}

示例用法:

public static void main(final String[] args){
    final int[] arr1 = new int[100];
    final int[] arr2 = new int[100];
    int x = 0, y = 0;
    for(int i = 0; i < 100; i++){
        if(i % 10 == 0){
            x++;
        }
        if(i % 12 == 0){
            y++;
        }
        arr1[i] = x;
        arr2[i] = y;
        x++;
        y++;
    }
    System.out.println(atLeastNMatchingElements(70, arr1, arr2));
    System.out.println(atLeastNMatchingElements(95, arr1, arr2));
}

输出:

真实
错误

Premature Optimizations™

我现在尝试优化上述代码。请检查标记为的代码块是否

/* optimization */

有明显差异。优化后,我会重构代码,将其缩减为一两个返回语句。

Something like this should do the job (provided that the arrays are sorted and contain unique elements):

public static boolean atLeastNMatchingElements(final int n,
    final int[] arr1,
    final int[] arr2){

    /* check assumptions */
    assert (arr1.length == arr2.length);

    final int arrLength = arr2.length;

    { /* optimization */
        final int maxOffset = Math.max(arrLength - n, 0);
        if(arr1[maxOffset] < arr2[0] || arr2[maxOffset] < arr1[0]){
            return false;
        }
    }

    int arr2Offset = 0;
    int matches = 0;

    /* declare variables only once, outside loop */
    int compResult; int candidate;

    for(int i = 0; i < arrLength; i++){
        candidate = arr1[i];
        while(arr2Offset < arrLength - 1){
            compResult = arr2[arr2Offset] - candidate;
            if(compResult > 0){
                break;
            } else{
                arr2Offset++;
                if(compResult == 0){
                    matches++;
                    break;
                }
            }
        }
        if(matches == n){
            return true;
        }
        /* optimization */
        else if(matches < n - (arrLength - arr2Offset)){
            return false;
        }
    }
    return false;
}

Sample usage:

public static void main(final String[] args){
    final int[] arr1 = new int[100];
    final int[] arr2 = new int[100];
    int x = 0, y = 0;
    for(int i = 0; i < 100; i++){
        if(i % 10 == 0){
            x++;
        }
        if(i % 12 == 0){
            y++;
        }
        arr1[i] = x;
        arr2[i] = y;
        x++;
        y++;
    }
    System.out.println(atLeastNMatchingElements(70, arr1, arr2));
    System.out.println(atLeastNMatchingElements(95, arr1, arr2));
}

Output:

true
false

Premature Optimizations™

I have now tried to optimize the above code. Please check whether the code blocks marked as

/* optimization */

make a noticeable difference. After optimization, I would refactor the code to get it down to one or two return statements.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文