使用 Java 从大整数数组中删除重复项

发布于 2024-09-18 09:31:58 字数 261 浏览 5 评论 0原文

您是否知道使用 Java 从非常大的整数数组中删除重复值的省时方法?数组的大小取决于登录的用户,但始终会超过 1500000 个未排序的值,并有一些重复项。每个整数都包含 100000 到 9999999 之间的数字。

我尝试将其转换为列表,但我的服务器上的堆不允许这么大的数据量(我的 ISP 对其进行了限制)。而 for 循环中的常规 for 循环需要 5 分钟以上的时间来计算。

没有重复项的数组的大小是我将存储在数据库中的数组的大小。

帮助将不胜感激!

Do you know of any time efficient way to remove duplicated values from a very big integer array using Java? The size of the array depends on the logged in user, but will always exceed 1500000 unsorted values with some duplicates. Every integer contains a number between 100000 and 9999999.

I tried converting it to a List, but the heap on my server doesn't allow this amount of data(my ISP has restricted it). And a regular for loop within a for loop takes over 5 minutes to calculate.

The size of the array without the duplicates is the one I will store in my database.

Help would be appreciated!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

忆离笙 2024-09-25 09:31:58

你也许可以使用一个位组?不知道Java的BitSet效率如何。但 9999999 个可能的值只需要 9999999 / 8 = 1250000 字节 = 刚刚超过 1Mb。当您遍历值数组时,将相应的位设置为 true。然后,您可以遍历该位集,并在发现某个位设置为 true 时输出相应的值。

1Mb 适合 CPU 缓存,因此根据位集实现,这可能非常有效。

这也有对数据进行排序的副作用。

而且...这是一个 O(n) 算法,因为它需要对输入数据进行一次传递,集合操作是 O(1) (对于像这样的基于数组的集合),并且输出传递也是 O( m) 其中 m 是唯一值的数量,根据定义,必须 <= n。

You could perhaps use a bit set? I don't know how efficient Java's BitSet is. But 9999999 possible values would only take 9999999 / 8 = 1250000 bytes = just over 1Mb. As you walk the array of values, set the corresponding bit to true. Then you can walk over the bit set and output the corresponding value whenever you find a bit set to true.

1Mb will fit in a CPU cache, so this could be quite efficient depending on the bit set implementation.

This also has the side-effect of sorting the data too.

And... this is an O(n) algorithm since it requires a single pass over the input data, the set operations are O(1) (for an array-based set like this), and the output pass is also O(m) where m is the number of unique values and, by definition, must be <= n.

酷炫老祖宗 2024-09-25 09:31:58

在开始将项目添加到列表之前,我会创建一个哈希集,在其中存储列表中包含的所有值。然后只需检查哈希集是否不包含您要添加的值。

I would make a hashset where I store all values contained in the list, before i start adding items to the list. Then just check so that the hashset doesn't contain the value you want to add.

安人多梦 2024-09-25 09:31:58
Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

您只需要一个 Integer[] 数组而不是 int[]

Set<Integer> set = new HashSet<Integer>();
Collections.addAll(set, array);

you will just need an array of Integer[] instead of int[].

聽兲甴掵 2024-09-25 09:31:58

您可以先尝试对数组进行排序:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates

You can try sorting the array first:

int arr[] = yourarray;
Arrays.sort(arr);
// then iterate arr and remove duplicates
拿命拼未来 2024-09-25 09:31:58

真正绝望的人可以将数组写入磁盘并分叉 sort |优衣库 | wc -l <​​infile.txt 并捕获输出。如果内存仍然太紧张或整数的域空间变得更大,则需要这样做。我不喜欢这个(他甚至运行unix吗?),但我的观点是有很多方法可以完成任务。

另一个观察结果是最小值为 100,000。因此,我们可以从最大值 9,999,999 中减去 100,000,减少域空间,从而节省一些内存。也许 100k/8 位在计划中是微不足道的,但它本质上是免费的。

The truly desperate could write the array to disk and fork off sort | uniq | wc -l <infile.txt and capture the output. This would be needed if memory was still too tight or the domain space of integers got larger. I don't like this (is he even running unix!) but my point is that there are many ways to accomplish the task.

Another observation is that the minimum value is 100,000. So we could subtract 100,000 from the maximum value of 9,999,999, reducing the domain space and thus saving some memory. Perhaps 100k/8 bits is peanuts in the scheme of things, but it's essentially free to do it.

尐偏执 2024-09-25 09:31:58
int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)
int[] a;
Arrays.sort(a);
int j = 0;
for (int i = 1; i < a.length; ++i) {
  if (a[i] != a[j]) {
    ++j;
    a[j] = a[i];
  }
}
// now store the elements from 0 to j (inclusive - i think)
止于盛夏 2024-09-25 09:31:58

也许您可以对数据进行几次传递?例如,如果您对数据进行了十次传递,并将上面的一组建议之一应用于数据的较小子集(例如,当 value mod pass# == 0 时)。因此:通过

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

这种方式,您将用时间换取内存(增加传递次数以减少内存/更多时间,反之亦然)。

Perhaps you could make a handful of passes over the data? For example, if you made ten passes over the data and applied one of the set suggestions above to a smaller subset of the data (say, when value mod pass# == 0). Thus:

for (int i = 0 to 9) {
  set = new Set()
  for (each entry in the data set) {
    if (entry % i == 0) {
      set.add(entry)
    }
  }
  output set
}

This way you will trade off time for memory (increase the number of passes for less memory/more time and vice-versa).

把人绕傻吧 2024-09-25 09:31:58

也许使用原语而不是对象的哈希集可以完成这项工作?有免费的实现(以前没有使用过它们,但也许它有效):

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/ trove/TIntHashSet.html

看起来像:

int[] newArray = new TIntHashSet(yourArray).toArray();

Maybe a hash set that works with primitives instead of objects will do the job? There are free implementations (havn't used them before but maybe it works):

http://trove4j.sourceforge.net/

http://trove4j.sourceforge.net/javadocs/gnu/trove/TIntHashSet.html

Would then look like:

int[] newArray = new TIntHashSet(yourArray).toArray();
岁月染过的梦 2024-09-25 09:31:58

如果您确定整数具有合理的小值(例如始终大于零且小于 1000 或 10000),您可以尝试这样的技巧:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

输出:[0, 10, 11, 99]

If you are sure, that integers have resonable small values (e.g. always more than zero and less than 1000 or 10000), you can try a trick like this:

    final int MAX = 100; 
    int[] arrayWithRepeats = {99, 0, 10, 99, 0, 11, 99};

    //we are counting here integers with the same value
    int [] arrayOfValues = new int[MAX+1];
    int countOfUniqueIntegers = 0;
    for(int i : arrayWithRepeats) {
        if(arrayOfValues[i] == 0) {
            countOfUniqueIntegers++;
        }
        arrayOfValues[i]++;
    }

    // you can use arrayOfValues (smaller) or convert it
    // to table of unique values (more usable)

    int[] arrayOfUniqueValues = new int[countOfUniqueIntegers];
    int index = 0;
    for(int i = 0; i<arrayOfValues.length; i++) {
        if(arrayOfValues[i] != 0) {
            arrayOfUniqueValues[index] = i;
            index++;
        }
    }

    //and now arrayOfUniqueValues is even sorted
    System.out.println( Arrays.toString(arrayOfUniqueValues) );

Output: [0, 10, 11, 99]

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文