计算大数组中的唯一元素
我的一位同事在接受采访时被问到这个问题。
给定一个存储 unsigned int 的巨大数组。数组的长度为 100000000。找到计算数组中存在的唯一元素数量的有效方法。
例如 arr = {2,34,5,6,7,2,2,5,1,34,5}
O/p: 2 的计数是 3,34 的计数是 2 等等在。
有哪些有效的算法可以做到这一点?我一开始认为字典/哈希将是选项之一,但由于数组非常大,因此效率低下。有什么办法可以做到这一点吗?
One of my colleagues was asked this question in an interview.
Given a huge array which stores unsigned int. Length of array is 100000000. Find the effective way to count the unique number of elements present in the array.
E.g arr = {2,34,5,6,7,2,2,5,1,34,5}
O/p: Count of 2 is 3, Count of 34 is 2 and so on.
What are effective algorithms to do this? I thought at first dictionary/hash would be one of the options, but since the array is very large it is inefficient. Is there any way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
堆排序是 O(nlogn) 且就地排序。处理大型数据集时,就地处理是必要的。排序后,您可以遍历数组来计算每个值的出现次数。因为数组是排序的,所以一旦值发生更改,您就知道您已经看到了前一个值的所有出现位置。
Heap sort is O(nlogn) and in-place. In-place is necessary when dealing with large data sets. Once sorted you can make one pass through the array tallying occurrences of each value. Because the array is sorted, once a value changes you know you've seen all occurrences of the previous value.
许多其他发帖者建议对数据进行排序,然后查找相邻值的数量,但没有人提到使用基数排序来使运行时间为 O(n lg U) (其中 U 是数组中的最大值)的 O(nlgn)。由于 lg U = O(lg n),假设整数占用一个机器字,这种方法渐近地比堆排序快。
在面试中,不进行比较总是很有趣。 :-)
Many other posters have suggested sorting the data and then finding the number of adjacent values, but no one has mentioned using radix sort yet to get the runtime to be O(n lg U) (where U is the maximum value in the array) instead of O(n lg n). Since lg U = O(lg n), assuming that integers take up one machine word, this approach is asymptotically faster than heapsort.
Non-comparison sorts are always fun in interviews. :-)
将其排序,然后从头开始扫描以确定每个项目的计数。
这种方法不需要额外的存储,并且可以在 O(n log n) 时间内完成(对于排序)。
Sort it, then scan it from the beginning to determine the counts for each item.
This approach requires no additional storage, and can be done in O(n log n) time (for the sort).
如果 int 值的范围有限,那么您可以分配一个数组,该数组用于计算每个可能值的出现次数。然后你只需迭代你的巨大数组并增加计数器。
因此,您可以在线性时间 (O(n)) 内找到解决方案,但会消耗内存。也就是说,如果你的整数跨越了 32 位整数允许的整个范围,你将需要分配一个 4G 整数的数组,这是不切实际的......
If the range of the int values is limited, then you may allocate an array, which serves to count the occurrences for each possible value. Then you just iterate through your huge array and increment the counters.
Thus you find the solution in linear time (O(n)), but at the expense of memory consumption. That is, if your ints span the whole range allowed by 32-bit ints, you would need to allocate an array of 4G ints, which is impractical...
如何使用 BloomFilter impl:例如 http://code.google.com/p/java-布隆过滤器/
首先执行bloom.contains(element),如果为true,则继续,如果为false,则执行bloom.add(element)。
最后计算添加的元素数量。布隆过滤器需要大约。 250mb 内存可存储 100000000 个元素,每个元素 10 位。
问题是 BloomFilter 中可能出现误报,并且只能通过增加每个元素的位数来最大限度地减少误报。这可以通过两个具有不同散列且需要达成一致的 BloomFilter 来解决。
How about using a BloomFilter impl: like http://code.google.com/p/java-bloomfilter/
first do a bloom.contains(element) if true continue if false bloom.add(element).
At the end count the number of elements added. Bloomfilter needs approx. 250mb memory to store 100000000 elements at 10bits per element.
Problem is that false positives are possible in BloomFilters and can only be minimized by increasing the number of bits per element. This could be addressed by two BloomFilters with different hashing that need to agree.
在这种情况下,散列并不是没有作用的。成本约为
O(N)
(迭代数组时为O(N)
,迭代数组时为 ~O(N)
哈希表)。由于检查每个元素需要O(N)
,因此复杂性很好。Hashing in this case is not inneficient. The cost will be approximately
O(N)
(O(N)
for iterating over the array and ~O(N)
for iterating over the hashtable). Since you needO(N)
for checking each element, the complexity is good.排序是个好主意。然而,排序类型取决于可能值的范围。对于小范围计数排序会很好。在处理如此大的数组时,使用多个核心会更有效 - 基数排序可能会更好。
Sorting is a good idea. However type of sorting depends on range of possible values. For small range counting sort would be good. While dealing with such a big array it would be efficient to use multiple cores - radix sort might be good.