当前位置：文江博客话题详情

使用 Java 从大整数数组中删除重复项

发布于 2024-09-18 09:31:58 字数 261 浏览 5 评论 0原文

您是否知道使用 Java 从非常大的整数数组中删除重复值的省时方法？数组的大小取决于登录的用户，但始终会超过 1500000 个未排序的值，并有一些重复项。每个整数都包含 100000 到 9999999 之间的数字。

我尝试将其转换为列表，但我的服务器上的堆不允许这么大的数据量（我的 ISP 对其进行了限制）。而 for 循环中的常规 for 循环需要 5 分钟以上的时间来计算。

没有重复项的数组的大小是我将存储在数据库中的数组的大小。

帮助将不胜感激！

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆离笙 2024-09-25 09:31:58

你也许可以使用一个位组？不知道Java的BitSet效率如何。但 9999999 个可能的值只需要 9999999 / 8 = 1250000 字节 = 刚刚超过 1Mb。当您遍历值数组时，将相应的位设置为 true。然后，您可以遍历该位集，并在发现某个位设置为 true 时输出相应的值。

1Mb 适合 CPU 缓存，因此根据位集实现，这可能非常有效。

这也有对数据进行排序的副作用。

而且...这是一个 O(n) 算法，因为它需要对输入数据进行一次传递，集合操作是 O(1) （对于像这样的基于数组的集合），并且输出传递也是 O( m) 其中 m 是唯一值的数量，根据定义，必须 <= n。

回复收藏 0 原文

酷炫老祖宗 2024-09-25 09:31:58

在开始将项目添加到列表之前，我会创建一个哈希集，在其中存储列表中包含的所有值。然后只需检查哈希集是否不包含您要添加的值。

回复收藏 0 原文

安人多梦 2024-09-25 09:31:58

Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

您只需要一个 Integer[] 数组而不是 int[] 。

Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

you will just need an array of Integer[] instead of int[].

回复收藏 0 原文

聽兲甴掵 2024-09-25 09:31:58

您可以先尝试对数组进行排序：

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates

You can try sorting the array first:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates

回复收藏 0 原文

拿命拼未来 2024-09-25 09:31:58

真正绝望的人可以将数组写入磁盘并分叉 sort |优衣库 | wc -l <infile.txt 并捕获输出。如果内存仍然太紧张或整数的域空间变得更大，则需要这样做。我不喜欢这个（他甚至运行unix吗？），但我的观点是有很多方法可以完成任务。

另一个观察结果是最小值为 100,000。因此，我们可以从最大值 9,999,999 中减去 100,000，减少域空间，从而节省一些内存。也许 100k/8 位在计划中是微不足道的，但它本质上是免费的。

回复收藏 0 原文

尐偏执 2024-09-25 09:31:58

int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)

int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)

回复收藏 0 原文

止于盛夏 2024-09-25 09:31:58

也许您可以对数据进行几次传递？例如，如果您对数据进行了十次传递，并将上面的一组建议之一应用于数据的较小子集（例如，当 value mod pass# == 0 时）。因此：通过

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

这种方式，您将用时间换取内存（增加传递次数以减少内存/更多时间，反之亦然）。

Perhaps you could make a handful of passes over the data? For example, if you made ten passes over the data and applied one of the set suggestions above to a smaller subset of the data (say, when value mod pass# == 0). Thus:

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

This way you will trade off time for memory (increase the number of passes for less memory/more time and vice-versa).

回复收藏 0 原文

把人绕傻吧 2024-09-25 09:31:58

也许使用原语而不是对象的哈希集可以完成这项工作？有免费的实现（以前没有使用过它们，但也许它有效）：

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/ trove/TIntHashSet.html

看起来像：

int[] newArray = new TIntHashSet(yourArray).toArray();

Maybe a hash set that works with primitives instead of objects will do the job? There are free implementations (havn't used them before but maybe it works):

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html

Would then look like:

int[] newArray = new TIntHashSet(yourArray).toArray();

回复收藏 0 原文

岁月染过的梦 2024-09-25 09:31:58

如果您确定整数具有合理的小值（例如始终大于零且小于 1000 或 10000），您可以尝试这样的技巧：

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

输出：[0, 10, 11, 99]

If you are sure, that integers have resonable small values (e.g. always more than zero and less than 1000 or 10000), you can try a trick like this:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

Output: [0, 10, 11, 99]

回复收藏 0 原文

~没有更多了~