如何从大量的数字中取出最大的数字?
我想从至少 100000000 个数字的列表中获取最大的 100 个元素。
我可以对整个列表进行排序,然后只从排序列表中取出最后 100 个元素,但这在内存和时间方面都非常昂贵。
有没有现有的简单的、Python 式的方法来做到这一点?
我想要的是跟随函数而不是纯粹的排序。 其实我不想浪费时间来对我不关心的元素进行排序。
例如,这是我想要的功能:
getSortedElements(100, lambda x,y:cmp(x,y))
请注意,此要求仅用于性能角度。
I'd like to get the largest 100 elements out from a list of at least 100000000 numbers.
I could sort the entire list and just take the last 100 elements from the sorted list, but that would be very expensive in terms of both memory and time.
Is there any existing easy, pythonic way of doing this?
What I want is following function instead of a pure sort. Actually I don't want waste time to sort the elements I don't care.
For example, this is the function I'd like to have:
getSortedElements(100, lambda x,y:cmp(x,y))
Note this requirement is only for performance perspective.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
标准库中的 heapq 模块提供了 nlargest() 函数来执行此操作:
它不会对整个列表进行排序,因此您不会在不需要的元素上浪费时间。
The heapq module in the standard library offers the nlargest() function to do this:
It won't sort the entire list, so you won't waste time on the elements you don't need.
您可以使用堆数据结构。 堆不一定是有序的,但它是保存半有序数据的相当快的方法,并且它的优点是最小的项始终是堆中的第一个元素。
堆有两个可以帮助您的基本操作:添加和替换。
基本上,您要做的就是向其中添加项目,直到达到 100 个项目(每个问题的前 N 个数字)。 然后,只要新项目大于第一个项目,就用每个新项目替换第一个项目。
每当你用更大的东西替换第一个项目时,堆中的内部代码就会调整堆内容,这样如果新项目不是最小的,它将向上冒泡到堆中,最小的项目将“向下冒泡”到第一个元素,随时准备更换。
You can use a Heap data structure. A heap will not necessarily be ordered, but it is a fairly fast way to keep semi-ordered data, and it has the benefit of the smallest item always being the first element in the heap.
A heap has two basic operations that will help you: Add and Replace.
Basically what you do is add items to it until you get to a 100 items (your top N number per your question). Then after that, you replace the first item with every new item, as long as the new item is bigger than the first item.
Whenever you replace the first item with something bigger, the internal code in the heap will adjust the heap contents so that if the new item is not the smallest, it will bubble up into the heap, and the smallest item will "bubble down" to the first element, ready to be replaced along the way.
选择算法在这里应该有所帮助。
一个非常简单的解决方案是找到第 100 个最大的元素,然后遍历列表,挑选出比该元素大的元素。 这将为您提供 100 个最大的元素。 这与列表的长度成线性关系; 这是最好的可能。
还有更复杂的算法。 例如,堆 就非常适合解决这个问题。 基于堆的算法为
n log k
,其中n
是列表的长度,k
是要选择的最大元素的数量。选择算法的维基百科页面上对此问题进行了讨论。
编辑:另一位发帖者指出,Python 有一个内置的解决方案来解决这个问题。 显然,这比自己动手要容易得多,但我会保留这篇文章,以防您想了解此类算法的工作原理。
Selection algorithms should help here.
A very easy solution is to find the 100th biggest element, then run through the list picking off elements that are bigger than this element. That will give you the 100 biggest elements. This is linear in the length of the list; this is best possible.
There are more sophisticated algorithms. A heap, for example, is very amenable to this problem. The heap based algorithm is
n log k
wheren
is the length of the list andk
is the number of largest elements that you want to select.There's a discussion of this problem on the Wikipedia page for selection algorithms.
Edit: Another poster has pointed out that Python has a built in solution to this problem. Obviously that is far easier than rolling your own, but I'll keep this post up in case you would like to learn about how such algorithms work.
对于观众中的算法新手:您可以通过 Tony Hoare 算法的简单变体来实现此目的 查找:
该算法将最大的
topn
元素放入数组a
的第一个topn
元素中, 没有对它们进行排序。 当然,如果您希望对它们进行排序,或者为了纯粹的简单性,堆更好,并且调用库函数更好。 但这是一个很酷的算法。For the algorithms weenies in the audience: you can do this with a simple variation on Tony Hoare's algorithm Find:
This algorithm puts the largest
topn
elements into the firsttopn
elements of arraya
, without sorting them. Of course, if you want them sorted, or for sheer simplicity, a heap is better, and calling the library function is better still. But it's a cool algorithm.这是我使用过的一个独立于库的解决方案
将在任何具有数组的编程语言中工作:
初始化:
对于输入列表中的每个值,例如 current_value:
minvalue 将快速获得一个高值,因此大多数值
输入列表中的值只需要与 minvalue 进行比较
(比较的结果大多是错误的)。
Here is a solution I have used that is independent of libraries and that
will work in any programming language that has arrays:
Initialisation:
For each value, say current_value, in the input list:
minvalue will quickly get a high value and thus most values
in the input list will only need to be compared to minvalue
(the result of the comparison will mostly be false).
做到这一点的最佳方法是维护一个堆排序的优先级队列,一旦其中有 100 个条目,您就可以将其弹出。
虽然您不关心结果是否已排序,但直观上显然您将免费获得此结果。 为了知道你有前 100 名,你需要通过一些有效的数据结构对当前的前 100 名数字进行排序。 该结构将以某种自然的方式知道每个元素的最小值、最大值和相对位置,您可以断言它在其邻居旁边的位置。
正如 python 中提到的,您将使用 heapq。 在java中优先级队列:
http://java.sun.com/javase/ 6/docs/api/java/util/PriorityQueue.html
The best way to do this is to maintain a heap sorted priority queue that you pop off of once it has 100 entries in it.
While you don't care if the results are sorted it is intuitively obvious you will get this for free. In order to know you have the top 100, you need to order your current list of top numbers in order via some efficient data structure. That structure will know the minimum, the maximum, and the relative position of each element in some natural way that you can assert it's position next to it's neighbors.
As has been mentioned in python you would use heapq. In java PriorityQueue:
http://java.sun.com/javase/6/docs/api/java/util/PriorityQueue.html