用于进行 k 选择的最坏情况 O(n) 算法
除了中位数算法之外,还有其他方法可以在最坏情况 O(n) 时间内进行 k 选择吗?实施中位数是否有意义?我的意思是,性能优势对于实际用途来说是否足够好?
Apart from the median-of-medians algorithm, is there any other way to do k-selection in worst-case O(n) time? Does implementing median-of-medians make sense; I mean, is the performance advantage good enough for practical purposes ?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
还有另一种基于软堆计算第k阶统计量的算法 数据结构,它是标准优先级队列的变体,允许“破坏”其存储的一定数量的优先级。该算法在维基百科文章中有更详细的描述,但基本思想是使用软堆有效地(O(n) 时间)为分区函数选择一个保证良好分割的主元。从某种意义上说,这只是中位数算法的修改版本,它使用(可以说)更直接的方法来选择主元元素。
软堆不是特别直观,但是本文中有对它们的很好的描述(“Chazelle 软堆的更简单实现和分析”),其中包括数据结构的正式描述和分析。
但是,如果您想要一个非常快的、最坏情况 O(n) 算法,请考虑考虑introselect。这个算法实际上非常聪明。它首先使用 快速选择算法,该算法会不智能地选择一个枢轴并使用它对数据进行分区。这在实践中非常快,但在最坏情况下表现不佳。 Introselect 通过跟踪跟踪其进度的内部计数器来修复此问题。如果算法看起来即将退化到 O(n2) 时间,它会切换算法并使用中位数之类的东西来确保最坏情况的保证。具体来说,它会监视每一步丢弃了多少数组,如果在丢弃一半输入之前发生了一些恒定数量的步骤,则算法会切换到中位数算法,以确保下一个主元之前是好的。然后使用快速选择重新启动。这保证了最坏情况下的 O(n) 时间。
该算法的优点是它对大多数输入都非常快(因为快速选择非常快),但具有很大的最坏情况行为。该算法的描述以及相关的排序算法 introsort 可以在 本文(“内省排序和选择算法”)。
希望这有帮助!
There is another algorithm for computing kth order statistics based on the soft heap data structure, which is a variant on a standard priority queue that is allowed to "corrupt" some number of the priorities it stores. The algorithm is described in more detail on the Wikipedia article, but the basic idea is to use the soft heap to efficiently (O(n) time) pick a pivot for the partition function that has a guarantee of a good split. In a sense, this is simply a modified version of the median-of-medians algorithm that uses an (arguably) more straightforward approach to choosing the pivot element.
Soft heaps are not particularly intuitive, but there is a pretty good description of them available in this paper ("A simpler implementation and analysis of Chazelle's soft heaps"), which includes a formal description and analysis if the data structure.
However, if you want a really fast, worst-case O(n) algorithm, consider looking into introselect. This algorithm is actually quite brilliant. It starts off by using the quickselect algorithm, which picks a pivot unintelligently and uses it to partition the data. This is extremely fast in practice, but has bad worst-case behavior. Introselect fixes this by keeping track of an internal counter that tracks its progress. If the algorithm ever looks like it's about to degrade to O(n2) time, it switches algorithms and uses something like median-of-medians to ensure the worst-case guarantee. Specifically, it watches how much of the array is discarded at each step, and if some constant number of steps occur before half the input is discarded, the algorithm switches to the median-of-medians algorithm to ensure that the next pivot is good before then restarting using quickselect. This guarantees worst-case O(n) time.
The advantage of this algorithm is that it's extremely fast on most inputs (since quickselect is very fast), but has great worst-case behavior. A description of this algorithm, along with the related sorting algorithm introsort, can be found in this paper ("Introspective Sorting and Selection Algorithms").
Hope this helps!
我认为当你的容器中有 N 百万个元素时,你应该真正测试它并找出性能如何。该算法已在 STL 库 (C++) 中实现,因为
std::nth_element
保证预期为 O(n)。因此,如果您使用 C++,您可以轻松地运行一些测试,看看性能是否足以满足您的需求。I think that you should really test it and find out what the performance is, when you have N million elements in your container. This algorithm has already been implemented in the STL library (C++) as
std::nth_element
is guarantueed to be expected O(n). So if you used C++, you could easily run some tests and see if the performance is good enough for what you seek.这取决于。如果你担心最坏的情况会意外发生,我不会打扰。随着数据增长到足以需要关注的程度,最坏的情况变得不可能发生,以至于几乎不值得防范。
如果您在客户端可以按最坏情况顺序提供数据以在服务器上执行拒绝服务的情况下进行选择,那么可能值得使用中位数的中位数来确保最坏情况顺序不会在任何程度上损害性能。
It depends. If you're concerned about the worst case happening accidentally, I wouldn't bother. As the data grows large enough to care, the worst case becomes so unlikely that it's hardly worth protecting against.
If you're doing the selection in a situation where a client could provide the data in the worst-case order to do a denial of service on your server, then it's probably worth using a median of medians to assure that the worst-case order doesn't hurt performance to any significant degree.
更新:
有一个线性时间算法,是快速排序的修改,由快速排序的发明者霍尔本人建议。
我建议参考CLRS书中的第9.3节“最坏情况线性时间的选择”。
这是简单的算法,假设我们有一个来自快速排序的方法
random_partition
(它使用随机主元进行分区):您还可以参考 Donald Knuth 的 TAOCP Vol.3 排序和搜索 p.633
这种方法的优点在于,数组不需要完全排序!
我认为STL nth_permutation使用了这种技术,你可以参考注释部分。
Updated:
There is a linear time algorithm, a modification to quick sort, suggest by quicksort's inventor Hoare himself.
I suggest referring to the section 9.3 "Selection in worst-case linear time" in CLRS book.
Here is the brief algorithm, assuming we have a method
random_partition
from quicksort (which uses a random pivot for partition):You can also refer to Donald Knuth's TAOCP Vol.3 Sorting and Searching p.633
The beauty of this method is that, the array need not be completely sorted!
I think the STL nth_permutation uses this technique, you can refer to the notes section.