快速排序:选择枢轴

发布于 2024-07-07 07:50:46 字数 488 浏览 14 评论 0原文

当实现快速排序时,你必须做的事情之一就是选择一个枢轴。 但是,当我查看下面的伪代码时,不清楚应该如何选择枢轴。 列表的第一个元素? 还有别的事吗?

 function quicksort(array)
     var list less, greater
     if length(array) ≤ 1  
         return array  
     select and remove a pivot value pivot from array
     for each x in array
         if x ≤ pivot then append x to less
         else append x to greater
     return concatenate(quicksort(less), pivot, quicksort(greater))

有人可以帮助我理解选择支点的概念以及不同的场景是否需要不同的策略。

When implementing Quicksort, one of the things you have to do is to choose a pivot. But when I look at pseudocode like the one below, it is not clear how I should choose the pivot. First element of list? Something else?

 function quicksort(array)
     var list less, greater
     if length(array) ≤ 1  
         return array  
     select and remove a pivot value pivot from array
     for each x in array
         if x ≤ pivot then append x to less
         else append x to greater
     return concatenate(quicksort(less), pivot, quicksort(greater))

Can someone help me grasp the concept of choosing a pivot and whether or not different scenarios call for different strategies.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(15

八巷 2024-07-14 07:50:46

选择随机主元可以最大程度地减少遇到最坏情况 O(n2) 性能的机会(始终选择第一个或最后一个会导致近排序或近逆序数据的最坏情况性能)。 在大多数情况下,选择中间元素也是可以接受的。

另外,如果您自己实现这一点,则有一些算法版本可以就地工作(即无需创建两个新列表然后将它们连接起来)。

Choosing a random pivot minimizes the chance that you will encounter worst-case O(n2) performance (always choosing first or last would cause worst-case performance for nearly-sorted or nearly-reverse-sorted data). Choosing the middle element would also be acceptable in the majority of cases.

Also, if you are implementing this yourself, there are versions of the algorithm that work in-place (i.e. without creating two new lists and then concatenating them).

红焚 2024-07-14 07:50:46

这取决于您的要求。 随机选择一个主元会使创建产生 O(N^2) 性能的数据集变得更加困难。 “三中位数”(第一个、最后一个、中间)也是避免问题的一种方法。 但要注意比较的相对表现; 如果你的比较成本很高,那么 Mo3 会比随机选择(单个主值)进行更多的比较。 比较数据库记录的成本可能很高。


更新:将评论拉入答案。

mdkess 断言:

“3 的中位数”不是第一个最后一个中间值。 随机选择三个指标,取其中的中间值。 重点是确保您对枢轴的选择不是确定性的 - 如果是确定性的,则可以很容易地生成最坏情况的数据。

我对此做出了回应:

  • 霍尔中位数查找算法的分析-三分区 (1997)
    作者:P Kirschenhofer、H Prodinger、C Martínez 支持您的论点(“三中位数”是三个随机项)。

  • 有一篇文章描述于portal.acm.org,内容是关于 Hannu Erkiö 的“三中位数快速排序的最坏情况排列”,发表于《计算机杂志》,第 27 卷,第 3 期,1984 年。[2012 年更新-02-26:获取文章的文本。 第 2 节“算法”开始:“通过使用 A[L:R] 的第一个、中间和最后一个元素的中位数,在大多数实际情况下可以实现有效划分为大小相当相等的部分。' 因此,它正在讨论第一个-中间-最后一个 Mo3 方法。]

  • 另一篇有趣的短文是 MD McIlroy 写的,“快速排序的杀手对手”,发表于《软件实践与经验》,第 1 卷。 29(0), 1–4 (0 1999)。 它解释了如何使几乎任何快速排序都呈二次方行为。

  • AT&T 贝尔实验室技术杂志,1984 年 10 月“构建工作排序例程的理论与实践”指出“Hoare 建议围绕几个随机选择的行的中位数进行分区。Sedgewick [...] 建议选择中位数第一个 [...] 最后 [...] 和中间”。 这表明“三中位数”的两种技术在文献中都是已知的。 (2014 年 11 月 23 日更新:该文章似乎可在 获取IEEE Xplore 或来自 Wiley — 如果您有会员资格或准备付费。)

  • “设计排序函数”,作者 JL Bentley 和 MD McIlroy,发表于 Software Practice and Experience,第 23(11) 卷,1993 年 11 月,对这些问题进行了广泛的讨论,他们选择了一种自适应的排序函数分区算法部分基于数据集的大小。 对于各种方法的权衡有很多讨论。

  • Google 搜索“三中位数”非常适合进一步跟踪。

感谢您的信息; 我之前只遇到过确定性的“三中位数”。

It depends on your requirements. Choosing a pivot at random makes it harder to create a data set that generates O(N^2) performance. 'Median-of-three' (first, last, middle) is also a way of avoiding problems. Beware of relative performance of comparisons, though; if your comparisons are costly, then Mo3 does more comparisons than choosing (a single pivot value) at random. Database records can be costly to compare.


Update: Pulling comments into answer.

mdkess asserted:

'Median of 3' is NOT first last middle. Choose three random indexes, and take the middle value of this. The whole point is to make sure that your choice of pivots is not deterministic - if it is, worst case data can be quite easily generated.

To which I responded:

  • Analysis Of Hoare's Find Algorithm With Median-Of-Three Partition (1997)
    by P Kirschenhofer, H Prodinger, C Martínez supports your contention (that 'median-of-three' is three random items).

  • There's an article described at portal.acm.org that is about 'The Worst Case Permutation for Median-of-Three Quicksort' by Hannu Erkiö, published in The Computer Journal, Vol 27, No 3, 1984. [Update 2012-02-26: Got the text for the article. Section 2 'The Algorithm' begins: 'By using the median of the first, middle and last elements of A[L:R], efficient partitions into parts of fairly equal sizes can be achieved in most practical situations.' Thus, it is discussing the first-middle-last Mo3 approach.]

  • Another short article that is interesting is by M. D. McIlroy, "A Killer Adversary for Quicksort", published in Software-Practice and Experience, Vol. 29(0), 1–4 (0 1999). It explains how to make almost any Quicksort behave quadratically.

  • AT&T Bell Labs Tech Journal, Oct 1984 "Theory and Practice in the Construction of a Working Sort Routine" states "Hoare suggested partitioning around the median of several randomly selected lines. Sedgewick [...] recommended choosing the median of the first [...] last [...] and middle". This indicates that both techniques for 'median-of-three' are known in the literature. (Update 2014-11-23: The article appears to be available at IEEE Xplore or from Wiley — if you have membership or are prepared to pay a fee.)

  • 'Engineering a Sort Function' by J L Bentley and M D McIlroy, published in Software Practice and Experience, Vol 23(11), November 1993, goes into an extensive discussion of the issues, and they chose an adaptive partitioning algorithm based in part on the size of the data set. There is a lot of discussion of trade-offs for various approaches.

  • A Google search for 'median-of-three' works pretty well for further tracking.

Thanks for the information; I had only encountered the deterministic 'median-of-three' before.

只为一人 2024-07-14 07:50:46

呵呵,我刚刚教过这门课。

有多种选择。
简单:选择范围的第一个或最后一个元素。 (对部分排序的输入不利)
更好:选择范围中间的项目。 (在部分排序的输入上效果更好)

但是,选择任意元素都会面临将大小为 n 的数组不良划分为大小为 1 和 n-1 的两个数组的风险。 如果你经常这样做,你的快速排序就会面临 O(n^2) 的风险。

我看到的一个改进是选择中位数(第一个,最后一个,中间);
在最坏的情况下,它仍然可以达到 O(n^2),但从概率上来说,这是一种罕见的情况。

对于大多数数据,选择第一个或最后一个就足够了。 但是,如果您发现经常遇到最坏的情况(部分排序的输入),第一个选择是选择中心值(对于部分排序的数据来说,这是一个统计上良好的主元)。

如果你仍然遇到问题,那就走中间路线。

Heh, I just taught this class.

There are several options.
Simple: Pick the first or last element of the range. (bad on partially sorted input)
Better: Pick the item in the middle of the range. (better on partially sorted input)

However, picking any arbitrary element runs the risk of poorly partitioning the array of size n into two arrays of size 1 and n-1. If you do that often enough, your quicksort runs the risk of becoming O(n^2).

One improvement I've seen is pick median(first, last, mid);
In the worst case, it can still go to O(n^2), but probabilistically, this is a rare case.

For most data, picking the first or last is sufficient. But, if you find that you're running into worst case scenarios often (partially sorted input), the first option would be to pick the central value( Which is a statistically good pivot for partially sorted data).

If you're still running into problems, then go the median route.

败给现实 2024-07-14 07:50:46

永远不要选择固定主元 - 这可能会被攻击以利用算法最坏情况的 O(n2) 运行时间,这只是自找麻烦。 当分区结果是一个包含 1 个元素的数组和一个包含 n-1 个元素的数组时,快速排序的最坏情况运行时间就会发生。 假设您选择第一个元素作为分区。 如果有人向您的算法提供一个按降序排列的数组,则您的第一个主元将是最大的,因此数组中的其他所有内容都将移至其左侧。 然后,当您递归时,第一个元素将再次成为最大的,因此您再次将所有内容放在它的左侧,依此类推。

更好的技术是3 中位数方法,您可以随机选择三个元素,然后选择中间的元素。 你知道你选择的元素不会是第一个或最后一个,而且根据中心极限定理,中间元素的分布将是正态的,这意味着你将倾向于中间(因此,nlog(n) 时间)。

如果您绝对想保证算法的运行时间为 O(nlog(n)),则用于查找数组中位数的 columns-of-5 方法 的运行时间为 O(n),这意味着最坏情况下快速排序的递推方程为:

T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)

根据主定理,这是 O(nlog(n))。 然而,常数因子将是巨大的,如果最坏情况性能是您主要关心的问题,请改用合并排序,它平均只比快速排序慢一点,并保证 O(nlog(n)) 时间(并且将比这个蹩脚的中值快速排序快得多)。

中位数中位数算法说明

Never ever choose a fixed pivot - this can be attacked to exploit your algorithm's worst case O(n2) runtime, which is just asking for trouble. Quicksort's worst case runtime occurs when partitioning results in one array of 1 element, and one array of n-1 elements. Suppose you choose the first element as your partition. If someone feeds an array to your algorithm that is in decreasing order, your first pivot will be the biggest, so everything else in the array will move to the left of it. Then when you recurse, the first element will be the biggest again, so once more you put everything to the left of it, and so on.

A better technique is the median-of-3 method, where you pick three elements at random, and choose the middle. You know that the element that you choose won't be the the first or the last, but also, by the central limit theorem, the distribution of the middle element will be normal, which means that you will tend towards the middle (and hence, nlog(n) time).

If you absolutely want to guarantee O(nlog(n)) runtime for the algorithm, the columns-of-5 method for finding the median of an array runs in O(n) time, which means that the recurrence equation for quicksort in the worst case will be:

T(n) = O(n) (find the median) + O(n) (partition) + 2T(n/2) (recurse left and right)

By the Master Theorem, this is O(nlog(n)). However, the constant factor will be huge, and if worst case performance is your primary concern, use a merge sort instead, which is only a little bit slower than quicksort on average, and guarantees O(nlog(n)) time (and will be much faster than this lame median quicksort).

Explanation of the Median of Medians Algorithm

酒中人 2024-07-14 07:50:46

不要试图变得太聪明并结合旋转策略。 如果你通过选择第一个、最后一个和中间的随机索引的中位数将 3 的中位数与随机主元结合起来,那么你仍然容易受到许多发送 3 二次中位数的分布的影响(所以它实际上比普通随机枢轴)

例如,管风琴分布 (1,2,3...N/2..3,2,1) 第一个和最后一个都将为 1,随机索引将是大于 1 的某个数字,取中位数给出 1 (无论是第一个还是最后一个),你都会得到一个极其不平衡的分区。

Don't try and get too clever and combine pivoting strategies. If you combined median of 3 with random pivot by picking the median of the first, last and a random index in the middle, then you'll still be vulnerable to many of the distributions which send median of 3 quadratic (so its actually worse than plain random pivot)

E.g a pipe organ distribution (1,2,3...N/2..3,2,1) first and last will both be 1 and the random index will be some number greater than 1, taking the median gives 1 (either first or last) and you get an extermely unbalanced partitioning.

倚栏听风 2024-07-14 07:50:46

将快速排序分为三个部分来执行此

  1. 交换或交换数据元素函数
  2. 会更容易。 分区函数
  3. 处理分区

它仅比一个长函数效率低一些,但更容易理解。

代码如下:

/* This selects what the data type in the array to be sorted is */

#define DATATYPE long

/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */

void swap (DATATYPE *x, DATATYPE *y){  
  DATATYPE Temp;

  Temp = *x;        // Hold current x value
  *x = *y;          // Transfer y to x
  *y = Temp;        // Set y to the held old x value
};


/* This is the partition code */

int partition (DATATYPE list[], int l, int h){

  int i;
  int p;          // pivot element index
  int firsthigh;  // divider position for pivot element

  // Random pivot example shown for median   p = (l+h)/2 would be used
  p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point

  swap(&list[p], &list[h]);                   // Swap the values
  firsthigh = l;                                  // Hold first high value
  for (i = l; i < h; i++)
    if(list[i] < list[h]) {                 // Value at i is less than h
      swap(&list[i], &list[firsthigh]);   // So swap the value
      firsthigh++;                        // Incement first high
    }
  swap(&list[h], &list[firsthigh]);           // Swap h and first high values
  return(firsthigh);                          // Return first high
};



/* Finally the body sort */

void quicksort(DATATYPE list[], int l, int h){

  int p;                                      // index of partition 
  if ((h - l) > 0) {
    p = partition(list, l, h);              // Partition list 
    quicksort(list, l, p - 1);        // Sort lower partion
    quicksort(list, p + 1, h);              // Sort upper partition
  };
};

It is easier to break the quicksort into three sections doing this

  1. Exchange or swap data element function
  2. The partition function
  3. Processing the partitions

It is only slightly more inefficent than one long function but is alot easier to understand.

Code follows:

/* This selects what the data type in the array to be sorted is */

#define DATATYPE long

/* This is the swap function .. your job is to swap data in x & y .. how depends on
data type .. the example works for normal numerical data types .. like long I chose
above */

void swap (DATATYPE *x, DATATYPE *y){  
  DATATYPE Temp;

  Temp = *x;        // Hold current x value
  *x = *y;          // Transfer y to x
  *y = Temp;        // Set y to the held old x value
};


/* This is the partition code */

int partition (DATATYPE list[], int l, int h){

  int i;
  int p;          // pivot element index
  int firsthigh;  // divider position for pivot element

  // Random pivot example shown for median   p = (l+h)/2 would be used
  p = l + (short)(rand() % (int)(h - l + 1)); // Random partition point

  swap(&list[p], &list[h]);                   // Swap the values
  firsthigh = l;                                  // Hold first high value
  for (i = l; i < h; i++)
    if(list[i] < list[h]) {                 // Value at i is less than h
      swap(&list[i], &list[firsthigh]);   // So swap the value
      firsthigh++;                        // Incement first high
    }
  swap(&list[h], &list[firsthigh]);           // Swap h and first high values
  return(firsthigh);                          // Return first high
};



/* Finally the body sort */

void quicksort(DATATYPE list[], int l, int h){

  int p;                                      // index of partition 
  if ((h - l) > 0) {
    p = partition(list, l, h);              // Partition list 
    quicksort(list, l, p - 1);        // Sort lower partion
    quicksort(list, p + 1, h);              // Sort upper partition
  };
};
夏末染殇 2024-07-14 07:50:46

这完全取决于数据最初的排序方式。 如果您认为它是伪随机的,那么您最好的选择是随机选择或选择中间。

It is entirely dependent on how your data is sorted to begin with. If you think it will be pseudo-random then your best bet is to either pick a random selection or choose the middle.

ゞ记忆︶ㄣ 2024-07-14 07:50:46

如果要对可随机访问的集合(如数组)进行排序,通常最好选择物理中间项。 这样,如果数组已全部排序(或接近排序),则两个分区将接近偶数,并且您将获得最佳速度。

如果您要对仅线性访问的内容(例如链接列表)进行排序,那么最好选择第一项,因为它是访问速度最快的项目。 然而,在这里,如果列表已经排序,那么你就完蛋了——一个分区将始终为空,而另一个分区则包含所有内容,从而产生最糟糕的时间。

然而,对于链表来说,选择第一个之外的任何东西只会让事情变得更糟。 它选择列出的列表中的中间项,您必须在每个分区步骤中逐步执行它 - 添加一个 O(N/2) 操作,该操作完成 logN 次,使总时间为 O(1.5 N *log N)也就是说,如果我们在开始之前知道列表有多长——通常我们不知道,所以我们必须一路单步计算它们,然后单步执行到中间找到中间的位置,然后单步执行第三次进行实际分区:O(2.5N * log N)

If you are sorting a random-accessible collection (like an array), it's general best to pick the physical middle item. With this, if the array is all ready sorted (or nearly sorted), the two partitions will be close to even, and you'll get the best speed.

If you are sorting something with only linear access (like a linked-list), then it's best to choose the first item, because it's the fastest item to access. Here, however,if the list is already sorted, you're screwed -- one partition will always be null, and the other have everything, producing the worst time.

However, for a linked-list, picking anything besides the first, will just make matters worse. It pick the middle item in a listed-list, you'd have to step through it on each partition step -- adding a O(N/2) operation which is done logN times making total time O(1.5 N *log N) and that's if we know how long the list is before we start -- usually we don't so we'd have to step all the way through to count them, then step half-way through to find the middle, then step through a third time to do the actual partition: O(2.5N * log N)

恏ㄋ傷疤忘ㄋ疼 2024-07-14 07:50:46

这很大程度上取决于您的具体需求和输入源/特定。

对于任何确定性的主元选择方法,都存在一个
最坏情况的输入实例将注定我们的时间是二次方的。 我们可以
在我们的算法中添加一个初始步骤,我们随机排列
在尝试对 n 个元素进行排序之前先确定它们的顺序。 这样的排列
可以在 O(n) 时间内构建。 这可能看起来像是浪费精力,
但它提供了我们可以期望 θ(n log n) 运行的保证
无论初始输入是什么。 最坏情况下的表现仍然
有可能发生,但现在只取决于我们有多不幸。 有
不再是明确定义的“最坏情况”输入。

(c) 算法设计手册第三版

非常感谢 Steven Skiena 的这本精彩的书<3

Its very depends on your specific needs and input source/specific.

For any deterministic method of pivot selection, there exists a
worst-case input instance which will doom us to quadratic time. We can
add an initial step to our algorithm where we randomly permute the
order of the n elements before we try to sort them. Such a permutation
can be constructed in O(n) time. This might seem like wasted effort,
but it provides the guarantee that we can expect Θ(n log n) running
time whatever the initial input was. The worst case performance still
can happen, but it now depends only upon how unlucky we are. There is
no longer a well-defined “worst-case” input.

(c) The Algorithm Design Manual 3rd edition

Big thanks to Steven Skiena for this wonderful book <3

送君千里 2024-07-14 07:50:46

理想情况下,主元应该是整个数组的中间值。
这将减少获得最坏情况性能的机会。

Ideally the pivot should be the middle value in the entire array.
This will reduce the chances of getting worst case performance.

天涯离梦残月幽梦 2024-07-14 07:50:46

在真正优化的实现中,选择主元的方法应该取决于数组大小 - 对于大型数组,花更多时间选择一个好的主元是值得的。 在不进行全面分析的情况下,我猜想“O(log(n)) 元素的中间”是一个好的开始,而且这还有一个不需要任何额外内存的额外好处:在较大的分区上使用尾部调用并在-进行分区时,我们在算法的几乎每个阶段都使用相同的 O(log(n)) 额外内存。

In a truly optimized implementation, the method for choosing pivot should depend on the array size - for a large array, it pays off to spend more time choosing a good pivot. Without doing a full analysis, I would guess "middle of O(log(n)) elements" is a good start, and this has the added bonus of not requiring any extra memory: Using tail-call on the larger partition and in-place partitioning, we use the same O(log(n)) extra memory at almost every stage of the algorithm.

隱形的亼 2024-07-14 07:50:46

快速排序的复杂度随主元值的选择而变化很大。 例如,如果您总是选择第一个元素作为主元,算法的复杂度将变得最差为 O(n^2)。 这是选择枢轴元素的聪明方法 -
1. 选择数组的第一个、中间、最后一个元素。
2. 比较这三个数字,找出大于一且小于其他数字的数字,即中位数。
3. 将此元素设为枢轴元素。

通过这种方法选择主元将数组分成近两半,因此复杂性很高
减少到 O(nlog(n))。

Quick sort's complexity varies greatly with the selection of pivot value. for example if you always choose first element as an pivot, algorithm's complexity becomes as worst as O(n^2). here is an smart method to choose pivot element-
1. choose the first, mid, last element of the array.
2. compare these three numbers and find the number which is greater than one and smaller than other i.e. median.
3. make this element as pivot element.

choosing the pivot by this method splits the array in nearly two half and hence the complexity
reduces to O(nlog(n)).

难忘№最初的完美 2024-07-14 07:50:46

平均而言,对于较小的 n,中位数为 3 比较合适。 对于较大的 n,中位数 5 更好一些。 第九个,即“三个中位数的中位数”对于非常大的 n 来说甚至更好。

随着 n 的增加,采样率越高,获得的效果就越好,但随着样本数的增加,改进速度会急剧减慢。 并且您会产生采样和分类样本的开销。

On the average, Median of 3 is good for small n. Median of 5 is a bit better for larger n. The ninther, which is the "median of three medians of three" is even better for very large n.

The higher you go with sampling the better you get as n increases, but the improvement dramatically slows down as you increase the samples. And you incur the overhead of sampling and sorting samples.

筱武穆 2024-07-14 07:50:46

我建议使用中间索引,因为它很容易计算。

您可以通过四舍五入(array.length / 2)来计算它。

I recommend using the middle index, as it can be calculated easily.

You can calculate it by rounding (array.length / 2).

番薯 2024-07-14 07:50:46

如果您选择数组中的第一个或最后一个元素,那么主元很可能是数组中最小或最大的元素,这是不好的。
为什么?
因为在这种情况下,元素的数量小于/大于 0 中的主元元素,并且这将重复如下:
考虑数组n的大小。

(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)

那么,时间复杂度从O(nlogn)增加到O(n^2)。 因此,我强烈建议使用数组的中值/随机元素作为基准。

If you choose the first or the last element in the array, then there are high chance that the pivot is the smallest or the largest element of the array and that is bad.
Why?
Because in that case the number of element smaller / larger than the pivot element in 0. and this will repeat as follow :
Consider the size of the array n.Then,

(n) + (n - 1) + (n - 2) + ......+ 1 = O(n^2)

Hence, the time complexity increases to O(n^2) from O(nlogn). So, I highly recommend to use median / random element of the array as the pivot.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文