预排序分析算法?
快速排序有一个众所周知的问题,即当数据集处于或几乎处于排序顺序时,性能会严重下降。在这种情况下,通常速度很慢的插入排序显然是最佳选择。问题是知道何时使用哪个。
是否有一种算法可用于运行数据集、应用比较因子并返回有关数据集与排序顺序的接近程度的报告?我更喜欢 Delphi/Pascal,但如果示例不太复杂,我可以阅读其他语言。
It's a well-known isssue with Quicksort that when the data set is in or almost in sort order, performance degrades horribly. In this case, Insertion Sort, which is normally very slow, is easily the best choice. The question is knowing when to use which.
Is there an algorithm available to run through a data set, apply a comparison factor, and return a report on how close the data set is to being in sort order? I prefer Delphi/Pascal, but I can read other languages if the example isn't overly complex.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
正如你所预料的那样,这里面有很多想法。三中位数技术意味着快速排序的最坏情况行为不会发生在已排序的数据上,而是发生在不太明显的情况下。
Introsort 非常令人兴奋,因为它完全避免了快速排序的二次最坏情况。它不是您自然的问题“我如何检测数据几乎已排序”,而是实际上在进行过程中问自己“这是否花费了太长时间?”。如果答案是肯定的,它将从快速排序切换到堆排序。
Timsort 将归并排序与插入排序相结合,在排序或逆序数据上表现良好,并且包含已排序或反向排序子集的数据。
因此,您的问题的答案可能是,“您不需要预传递分析,您需要自适应排序算法”。
As you'd expect quite a lot of thought goes into this. The median-of-three technique means that quicksort's worst case behaviour doesn't occur for sorted data, but instead for less obvious cases.
Introsort is quite exciting, since it avoids quicksort's quadratic worst case altogether. Instead of your natural question, "how do I detect that the data is nearly-sorted", it in effect asks itself as it's going along, "is this taking too long?". If the answer is yes, it switches from quicksort to heapsort.
Timsort combines merge sort with insertion sort, and performs very well on sorted or reverse-sorted data, and on data that includes sorted or reverse-sorted subsets.
So probably the answer to your question is, "you don't need a pre-pass analysis, you need an adaptive sort algorithm".
还有 SmoothSort,它的实现显然相当棘手,但它在 O(N log N) 到 O(N) 之间变化,具体取决于数据开始的排序方式。
http://en.wikipedia.org/wiki/Smoothsort
长而棘手的 PDF:
http://www.cs.utexas.edu/users/EWD/ ewd07xx/EWD796a.PDF
但是,如果您的数据确实很大并且您必须串行访问它,那么归并排序可能是最好的选择。它总是 O(N log N) 并且具有出色的“局部性”属性。
There's also SmoothSort, which is apparently quite tricky to implement, but it varies between O(N log N) to O(N) depending on how sorted the data is to start with.
http://en.wikipedia.org/wiki/Smoothsort
Long tricky PDF:
http://www.cs.utexas.edu/users/EWD/ewd07xx/EWD796a.PDF
However, if your data is truly huge and you have to access it serially, mergesort is probably the best. It's always O(N log N) and it has excellent 'locality' properties.
我没有听说过任何预排序分析,但我的观点是,如果您要遍历数据集进行分析,那么您已经降低了整体排序时间的性能。
I've not heard of any pre-sorting analysis but my opinion is that if you are going to go through the dataset to analyze it then you are already cutting into performance of your overall sorting time.
一种可能的解决方案是获取当前排序范围(在快速排序操作期间)中的第一个、最后一个和中间元素,并选择中间的元素作为基准元素。
One possible solution is to take first, last and the middle element in the current sort range (during the QuickSort operation) and chose the middle one as the pivot element.
为了充分分析以决定使用哪种算法,您将几乎完成排序工作。您可以执行一些操作,例如以小比例随机但递增的索引检查值(即分析项目的小样本)。
To fully analyze for the purpose of deciding which algorithm to use, you are going to do nearly the work of sorting. You could do something like check the values at a small percentage of random but increasing indexes (ie analyze a small sample of the items).
您仍然需要遍历所有记录以确定其是否已排序,因此为了提高性能,请从第一个记录开始,然后遍历其余记录,直到您发现某些内容未正确排序,或者到达列表末尾。如果您发现未命中,则仅对从该位置到末尾的项目进行排序(因为列表的开头已经排序)。
在第二部分的每个项目中,查看该项目是否 <比第一部分中的最后一个元素更重要,如果是这样,则仅对第一部分使用插入排序。否则,针对第二部分中的所有其他项目进行快速排序。这样,排序就针对特定情况进行了优化。
You would still have to run through all records to determine if its sorted or not, so to improve performance, start with your first record and run though the rest until you either notice something not properly sorted, or reach the end of the list. If you find miss then only sort items from that position to the end (since the beginning of the list is already sorted).
At each item in the second part, see if the item is < than the last element in the first part and if so use an insertion sort into ONLY the first part. Otherwise Quicksort against all other items in the second part. This way the sort is optimized for the specific case.
只有当数据集很大并且已经大部分排序时,快速排序才会出现问题,我会使用以下启发式方法(等待完整的解决方案):
如果数据集大小低于阈值,请不要打扰。
如果您可以快速(索引)访问记录(项目),请抽取每 N 条记录中 1 条记录的样本,并查看它们是否已排序。对于小样本来说应该足够快,然后您可以决定是否使用快速排序。
QuickSort beng a problem only when the data set is huge and already mostly sorted, I would use the following heuristics (pending a full blown solution):
Don't bother if data set size is below threshold.
If you have a quick (indexed) access to records(items) take a sample with 1 record in every N records and see if they are already sorted. Should be quick enough for a small sample and you can then decide to use quick sort or not.
提出一个人们尚未提出的概念观点:快速排序是一种常识性的分治算法,在极少数情况下存在明显的错误。假设您要对一堆学生论文进行排序。 (这与一些规律有关。)在快速排序算法中,您选择一些纸张,即枢轴。然后根据其他论文是在枢轴之前还是之后进行划分。然后对两个子桩重复此操作。有什么错误?枢轴可以是靠近列表一端而不是中间的名称,因此将其分成两堆并没有多大作用。
合并排序是另一种以不同顺序工作的分而治之算法。您可以在线性时间内合并两个排序列表。将论文分成相等或几乎相等的两堆,然后对每一堆进行递归排序,然后合并。归并排序没有任何错误。快速排序比合并排序更受欢迎的原因之一是历史性的:快速排序速度很快(通常)并且不需要任何额外的内存。但如今,保存比较比节省内存更重要,并且实际的重新排列通常是通过排列指针来抽象的。如果事情一直都是这样,那么我怀疑合并排序会比快速排序更流行。 (也许在名字中添加“快速”是很好的推销技巧。)
To make a conceptual point that people haven't yet made: Quicksort is a common-sense divide-and-conquer algorithm with an obvious bug in rare cases. Suppose that you want to sort a stack of student papers. (Which I have to do with some regularity.) In the quicksort algorithm, you pick some paper, the pivot. Then divide the other papers according to whether they are before or after the pivot. Then repeat that with the two subpiles. What's the bug? The pivot could be a name that is near one end of the list instead of in the middle, so that it doesn't accomplish much to divide it into two piles.
Merge sort is another divide-and-conquer algorithm that works in a different order. You can merge two sorted lists in linear time. Divide the papers into two equal or nearly equal piles, then recursively sort each one, then merge. Merge sort doesn't have any bugs. One reason that quicksort is more popular than merge sort is historical: Quicksort is fast (usually) and it works without any extra memory. But these days, it can be more important to save comparisons than to save memory, and the actual rearrangement is often abstracted by permuting pointers. If things had always been that way, then I suspect that merge sort would simply have been more popular than quicksort. (And maybe adding "quick" to the name was good salesmanship.)