将流式数据读取到排序列表中
我们知道,一般来说,对任意数据进行“更智能”的比较排序在最坏情况下的复杂度为 O(N * log(N))。
我的问题是,如果我们被要求不对集合进行排序,而是对数据流进行排序,会发生什么。也就是说,值是一一给我们的,没有指示接下来会发生什么(除了数据有效/在范围内)。直观上,人们可能会认为,在数据传入时对其进行排序(例如一手拿起一手扑克牌)比收集所有数据并稍后进行排序(在发牌后对一手扑克牌进行排序)要好。事实真的是这样吗?
收集和排序的时间复杂度为 O(N + N * log(N)) = O(N * log(N))。但是,如果我们按输入顺序对它进行排序,则为 O(N * K),其中 K = 找到正确索引的时间 + 插入元素的时间。这使事情变得复杂,因为 K 的值现在取决于我们对数据结构的选择。数组在查找索引方面表现出色,但在插入元素时会浪费时间。链表可以更容易地插入,但不能二分查找来查找索引。
这个问题有完整的讨论吗?我们什么时候应该使用一种方法或另一种方法?是否存在一种理想的中间策略,每隔一段时间进行排序?
We know that, in general, the "smarter" comparison sorts on arbitrary data run in worst case complexity O(N * log(N)).
My question is what happens if we are asked not to sort a collection, but a stream of data. That is, values are given to us one by one with no indicator of what comes next (other than that the data is valid/in range). Intuitively, one might think that it is superior then to sort data as it comes in (like picking up a poker hand one by one) rather than gathering all of it and sorting later (sorting a poker hand after it's dealt). Is this actually the case?
Gathering and sorting would be O(N + N * log(N)) = O(N * log(N)). However if we sort it as it comes in, it is O(N * K), where K = time to find the proper index + time to insert the element. This complicates things, since the value of K now depends on our choice of data structure. An array is superior in finding the index but wastes time inserting the element. A linked list can insert more easily but cannot binary search to find the index.
Is there a complete discussion on this issue? When should we use one method or another? Might there be a desirable in-between strategy of sorting every once in a while?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
平衡树排序具有
O(N log N)
复杂度并保持添加元素时按排序顺序列出。Balanced tree sort has
O(N log N)
complexity and maintains the list in sorted order while elements are added.绝对不是!
首先,如果我可以对流中数据进行排序,我可以接受
O(N)
中的所有数据,然后将其流式传输给自己并使用更快的方法对其进行排序。即,您可以执行从所有数据到流的减少,这意味着它不能更快。其次,您描述的是插入排序,它实际上在
O(N^2)
时间内运行(即您对O(NK)
的描述是正确的,但是K
不是常数,而是N
的函数),因为可能需要O(N)
时间才能找到合适的索引。您可以将其改进为二进制插入排序,但这将在O(NlogN)
中运行(假设您使用的是链表,数组仍然需要O(N^2 )
即使进行了二进制优化),所以你还没有真正保存任何东西。也许还值得一提的是一般原则;只要您处于比较模型中(即您没有关于您正在排序的数据的任何重要且有用的信息,这是一般情况),任何排序算法都将是最好的
O(NlogN)。即,该模型中排序算法的最坏情况运行时间为 omega(NlogN)。这不是一个假设,而是一个定理。因此不可能更快地找到任何东西(在相同的假设下)。
Absolutely not!
Firstly, if I can sort in-streaming data, I can just accept all my data in
O(N)
and then stream it to myself and sort it using the quicker method. I.e. you can perform a reduction from all-data to stream, which means it cannot be faster.Secondly, you're describing an insertion sort, which actually runs in
O(N^2)
time (i.e. your description ofO(NK)
was right, butK
is not constant, rather a function ofN
), since it might takeO(N)
time to find the appropriate index. You could improve it to be a binary insertion sort, but that would run inO(NlogN)
(assuming you're using a linked list, an array would still takeO(N^2)
even with the binary optimisation), so you haven't really saved anything.Probably also worth mentioning the general principle; that as long as you're in the comparison model (i.e. you don't have any non-trivial and helpful information about the data which you're sorting, which is the general case) any sorting algorithm will be at best
O(NlogN)
. I.e. the worst-case running time for a sorting algorithm in this model isomega(NlogN)
. That's not an hypothesis, but a theorem. So it is impossible to find anything faster (under the same assumptions).好的,如果流的时间相对较慢,那么当最后一个元素到达时,您将得到一个完全排序的列表(减去最后一个元素)。然后,剩下要做的就是一个单个二分搜索循环, O(log n) 不是完整的二分排序, O(n log n)。由于您在其他排序算法上处于领先地位,因此可能会带来明显的性能提升。
管理、排队和从流中提取数据是一个完全不同的问题,并且可能会适得其反。我不建议这样做,除非您可以在与流式传输一个或两个元素大约相同的时间内对完整的数据集进行排序(并且您对编码流式传输部分感到满意)。
Ok, if the timing of the stream is relatively slow, you will have a completely sorted list (minus the last element) when your last element arrives. Then, all that remains to do is a single binary search cycle, O(log n) not a complete binary sort, O(n log n). Potentially, there is a perceived performance gain, since you are getting a head-start on the other sort algorithms.
Managing, queuing, and extracting data from a stream is a completely different issue and might be counter-productive to your intentions. I would not recommend this unless you can sort the complete data set in about the same time it takes to stream one or maybe two elements (and you feel good about coding the streaming portion).
在树排序表现不佳的情况下使用堆排序,即大数据集,因为树排序需要额外的空间来存储树结构。
Use Heap Sort in those cases where Tree Sort will behave badly i.e. large data set since Tree sort needs additional space to store the tree structure.