改进快速排序
如果可能,我如何改进以下快速排序(性能方面)。有什么建议吗?
void main()
{
quick(a,0,n-1);
}
void quick(int a[],int lower,int upper)
{
int loc;
if(lower<upper)
{
loc=partition(a,lower,upper);
quick(a,lower,loc-1);
quick(a,loc+1,upper);
}
}
/* Return type: int
Parameters passed: Unsorted array and its lower and upper bounds */
int partition(int a[],int lower,int upper)
{
int pivot,i,j,temp;
pivot=a[lower];
i=lower+1;
j=upper;
while(i<j)
{
while((i<upper)&&(a[i]<=pivot))
i++;
while((a[j]>pivot))
j--;
if(i<j)
{
temp=a[i];
a[i]=a[j];
a[j]=temp;
}
}//end while
if(pivot>a[j])
{
temp=a[j];
a[j]=a[lower];
a[lower]=temp;
}
return(j);
}//end partition
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(14)
快速排序算法很容易并行化。如果您有多个核心可供使用,您可能会看到相当多的速度提升。根据您的数据集有多大,它可以轻松地为您提供比任何其他优化更高的速度。但是,如果您只有一个处理器或相对较小的数据集,则不会有太大的加速。
如果有可能的话,我可以详细说明。
The quicksort algorithm is easily parallelized. If you have multiple cores to work with, you could see quite a bit of speed up. Depending on how large your data set is, it could easily provide you with more speed up than any other optimization. However, if you have only one processor or a relatively small data set, there won't be much of a speed up.
I could elaborate more if this is a possibility.
第一个建议是:用迭代替换其中一个递归调用。我的意思是真正的迭代,而不是手动实现的递归堆栈。即,不要从
quick
对quick
进行两次“新”调用,而是“回收”对quick
的当前调用处理递归的一个分支,并递归调用quick
来处理另一个分支。现在,如果您确保始终对较短的分区进行真正的递归调用(并对较长的分区使用迭代),它将保证递归深度永远不会超过 log N 即使在最坏的情况下,即无论您选择中位数的效果如何。
这一切都是在 GCC 标准库附带的
qsort
算法中实现的。看一下源码,应该有用。粗略地说,遵循上述建议的更实际的实现可能如下所示。
当然,这只是这个想法的一个草图。未测试。再次看一下 GCC 的实现,了解同样的想法。他们还将剩余的递归调用替换为“手动”递归,但这并不是真正必要的。
The first suggestion would be: replace one of the recursive calls with iteration. And I mean real iteration, not a manually implemented stack for recursion. I.e. instead of making two "new" calls to
quick
fromquick
, "recycle" your current call toquick
to process one branch of recursion, and callquick
recursively to process another branch.Now, if you make sure that you always make real recursive call for the shorter partition (and use iteration for the longer one), it will guarantee that the depth of recursion will never exceed
log N
even in the worst case, i.e. regardless of how well you choose your median.This all is implemented in
qsort
algorithm that comes with GCC standard library. Take a look at the source, it should be useful.Roughly, a more practical implementation that follows the above suggestion might look as follows
This is just a sketch of the idea, of course. Not tested. Again, take a look at GCC implementation for the same idea. They also replace the remaining recursive call with "manual" recursion, but it is not really necessary.
使用无循环算法对小块(<5个元素)进行排序可以提高性能。我只找到了一个如何为 5 个元素编写无循环排序算法的示例: http://wiki.tcl.tk/ 4118
这个想法可用于为 C 中的 2,3,4,5 元素生成无循环排序算法。
编辑:我在一组数据上尝试过,与中的代码相比,我测量了 87% 的运行时间的问题。在 <20 个块上使用插入排序在同一数据集上的结果为 92%。此测量不具有代表性,但可能表明这是改进快速排序代码的一种方法。
编辑:此示例代码写出 2-6 个元素的无循环排序函数:
生成的代码长 3718 行:
Sorting small blocks (<5 elements) with a loopless algorithm may improve performance. I found only one example how to write a loopless sorting algorithm for 5 elements: http://wiki.tcl.tk/4118
The idea can be used to generate loopless sorting algorithms for 2,3,4,5 elements in C.
EDIT: i tried it on one set of data, and i measured 87% running time compared to the code in the question. Using insertion sort on <20 blocks resulted 92% on the same data set. This measurement is not representative but may show that this is a way how can You improve Your quicksort code.
EDIT: this example code write out loopless sorting functions for 2-6 elements:
The generated code 3718 lines long:
尝试另一种排序算法。
根据您的数据,您也许可以用内存来换取速度。
根据维基百科
< em>编辑
显然你的数据是整数。对于 [0, 0x0fffffff] 范围内的 250 万个整数,我的基数排序实现速度大约是您的快速排序实现速度的 4 倍。
Try another sort algorithm.
Depending on your data, you may be able to trade memory for speed.
According to Wikipedia
Edit
Apparently your data is integers. With 2.5M integers in the range [0, 0x0fffffff], my implementation of radix-sort is about 4 times as fast as your implementation of quick-sort.
关于快速排序的维基百科文章有很多想法。
The wikipedia article on quicksort has a bunch of ideas.
您可以通过使用带有显式堆栈的快速排序来消除递归开销
您可以使用更好的主元选择技术,例如:
You can a eliminate recuriosn overhead by using QuickSort with Explicit Stack
You can use better pivot selection technique such as:
目前广泛使用的最先进的快速排序是在Java的DualPivotQuicksort.java
因此,您可以简单地遵循该方法,您将看到很好的性能改进:
或者如果您想增加一点复杂性,则编写 3-pivot 快速排序。
Currently the most advanced quicksort widely used is implemented in Java's DualPivotQuicksort.java
So you can simply follow that approach and you will see a nice performance improvement:
Or if you want to add a bit more complexity then code a 3-pivot quicksort.
如果这不仅仅是为了学习,请使用 stdlib.h 中的 qsort
If this is not just for learning use qsort from stdlib.h
根据您的代码,当排序长度为 10 时,最深的堆栈看起来像
除了算法本身之外,如果我们还考虑堆栈活动等系统行为,除了正常调用堆栈成本(推送/弹出)之外的其他内容可能会降低性能很多,例如上下文切换,如果多任务系统,您知道大多数操作系统都是多任务的,并且堆栈越深,切换成本越高,加上缓存未命中或跨缓存线边界。
我相信在这种情况下,如果长度变大,与其他一些排序算法相比,你就会失败。
例如,当长度达到 40 时,堆栈可能看起来像(可能比下面显示的条目更多):
堆栈太深。
函数内联也很有帮助,但它增加了最终可执行文件的代码占用量。
我同意@Swiss,并行编程可能对你有很大帮助。
Per your code, when sorted lenght is 10, the deepest stack looks like
Besides the algo itself, if we consider the system behavior such as stack activities as well, something else except normal call stack costs (push/pop) might downgrade the performance a lots, e.g. context switching if multi-task system, you know most OS are multi-task, and deeper the stack higher costs the switching, plus cache miss or cachline boundary across.
I believe in this case if the length become bigger, you'll lose when comparing to some other sorting algos.
for example, when length is up to 40, the stack may look like (may more entries than shown as below):
too deeper the stack.
Function inlining is helpful too, but it increases code footprint of the final executable.
I agree with @Swiss, parallel programming might help you so much.
完全愚蠢的答案,但是......在发布模式下编译代码并打开优化!
completely dumb answer, but... compile your code in release mode and turn on optimizations !
首先要做的就是对其进行基准测试。并将其与标准 qsort、c++ std::sort 和 std::stable_sort 进行基准测试。
如果您的数据集足够大,您的结果将表明 std::sort 在所有情况下都优于 qsort,除了 std::stable_sort 优于的情况。这是因为 std::sort 是模板化的,因此可以内联比较。
您的代码内联比较,因为它不是通用的。如果你的代码比 qsort 慢,那么你就有问题了。
更快的排序是并行对各个部分进行排序,例如 openmp,然后将它们合并在一起。
The first thing to do is benchmark it. And benchmark it against the standard qsort, and against the c++ std::sort and std::stable_sort.
Your results, should your data-set be big enough, will show that the std::sort is superior to qsort in all cases except those where the std::stable_sort is superior. This is because the std::sort is templated, and therefore the comparision can be inlined.
Your code inlines the comparison because its not generic. If your code is slower than qsort, you have a problem right there.
The faster sort would be to sort parts in parallel, e.g. openmp, and then merge them back together.
从我的回答中复制来回答问题。
编辑:本文假设您已经做了一些显而易见的事情,例如利用尾递归来消除不必要的调用开销。
人们喜欢批评快速排序在某些输入上表现不佳,尤其是当用户可以控制输入时。以下方法产生与中点选择匹配的性能,但随着列表大小的增长,预期复杂度呈指数级接近 O(n log n)。根据我的经验,由于大多数情况下的开销要少得多,因此它的性能明显优于三选一的选择。它应该在“预期”输入的中点选择下表现均匀,但不易受到不良输入的影响。
I
初始化快速排序。该值将在整个排序过程中使用(不必生成多个数字)。I mod SectionSize
。为了获得额外的性能,您应该始终将“小”列表段的快速排序切换为希尔排序 - 我已经看到选择 15-100 的长度作为截止值。
Copied from my answer to answer question.
Edit: This post assumes you already do obvious things like take advantage of tail recursion to get rid of the unnecessary call overhead.
People like to criticize the quicksort for poor performance with certain inputs, especially when the user has control of the input. The following approach yields performance matching midpoint selection but expected complexity exponentially approaches O(n log n) as the list grows in size. In my experience it significantly outperforms best-of-3 selection due to much less overhead in the majority case. It should perform evenly with midpoint selection for "expected" inputs, but is not vulnerable to poor inputs.
I
. That value will be used throughout the sorting process (don't have to generate multiple numbers).I mod SectionSize
.For additional performance, you should always switch your quicksort to shell sort for "small" list segments - I've seen lengths from 15-100 chosen as the cutoff.
多线程?
multithreading ?