比较排序算法在每一步评估超过 2x 的元素
比较排序算法通常在每个步骤评估 2x 个元素(如果第一个元素小于/大于或等于第二个元素,则执行不同的操作。)
排序算法的一些示例,最好是稳定的,其中超过每一步都可以比较 2x 个元素?
如何评估它们的性能?
对于一次排序 2 个元素,比较排序的平均性能无法优于 O(n log n)
,一次排序 3、4、5+ 元素的性能限制是多少?
Comparison sorting algorithms normally evaluate 2x elements at each step (and do something different if the first element is smaller/bigger than or equal to the second element.)
What are some examples of sorting algorithms, preferably stable ones, where more than 2x elements can be compared at each step?
How would their performance be evaluated?
For sorting 2 elements at a time, a comparison sort cannot perform better than O(n log n)
on average, what's the performance limit for sorting 3, 4, 5+ elements at a time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
一次任何 n 元素(对于指定的固定 n)比较可以通过一系列固定数量的一次 2 元素比较来模拟。如果您一次可以做得比
O(n log n)
更好,它会自动为您提供一个比O(n log n)
更好的算法一次 2 个。Any n-element at a time (for a specified fixed n) comparisons can be emulated by a series of some fixed number of 2-element at a time comparisons. If you could do better than
O(n log n)
on n-at-a-time, it would automatically give you a better thanO(n log n)
algorithm for 2-at-a-time.k 路归并排序需要确定 k 个元素中哪一个最小。使用标准 2x 比较,需要 3 次比较才能确定最小元素。对于基于内存的排序,4 路合并排序所需的操作数与 2 路合并排序的操作数大致相同,但由于对缓存更加友好,因此运行时间略有改善。 有一个4路合并排序的例子:
优化的合并排序比快速排序更快
在这个答案的末尾 一个有 16 个寄存器、4 路的系统大约是你能达到的最高水平。另外,尝试在不使用堆的情况下实现这一点(这会导致算法变慢)需要一个大型的 if | 嵌套树。 else 语句。在上面链接的答案中,四路合并排序已经足够复杂了。
A k-way merge sort needs to determine which of k elements is the smallest. Using a standard 2x compare, it will take 3 compares to determine the smallest element. For a memory based sort, a 4-way merge sort will take the about the same number of operations as a 2-way merge sort, but gets a slight improvement in running time due to being more cache friendly. There is an example of a 4 way merge sort at the end of this answer:
Optimized merge sort faster than quicksort
On a system with 16 registers, 4-way is about as high as you can go. Plus trying to implement this without using a heap (which results in a slower algorithm) requires a large nested tree of if | else statements. It's already complex enough in the 4-way merge sort in that answer linked to above.
使用尽可能多的同时比较来搜索有效的排序方法是通过“排序网络”完成的,维基百科有一篇关于它的好文章 对网络进行排序。
这里是对从 n=3 到 n=7 的数组进行排序网络的伪代码。
可以使用多线程、矢量化或专用硬件同时完成同一行(步骤)的比较。
对 n 个元素进行排序的理论极限是 O(log(n)) 步数,其中 n 足够大,但是构建最优排序网络是一个 NP 问题,有一些算法可以创建 O(log^2( n)) 步骤。目前已知 n 最大为 17 的最佳排序网络。请注意,比较次数仍然是 O(n log(n))。
The search for efficient ways to sort using as many simultaneous comparisons as possible is done with "sorting networks", Wikipedia has a nice article about it Sorting Networks.
Here is the pseudocode for sorting networks for arrays from n=3 to n=7.
Comparisons on the same row (step) can be done simultaneously, using multithreading, vectorization or specialized hardware.
The theoretical limit for sorting n elements is O(log(n)) number of steps, with n sufficient big, but to construct an optimal sorting network is a NP problem, there are algoritms to create sorting networks with O(log^2(n)) steps. Currently there are known optimal sorting networks for n up to 17. Note that the number of comparisons is still O(n log(n)).
要一次对四个字节元素进行排序,可以使用查找表。
但 RAM 延迟会使其变慢,除非字节值倾向于命中 L1 缓存。
如果元素值非常有限,例如整数 0、1、2、3,则每个值仅需要 2 位。然后,您可以仅使用适合 L1 的 256 大小的查找表以每次操作 4 次进行排序。
如果值只是 0 和 1,那么您甚至不需要查找表。只需计算变量中的 1,CPU 指令即可在 O(1) 内完成此操作。
CPU 管道指令,因此只要您比较(和交换)独立元素,您就可以根据管道的深度获得并行性。
CPU 具有 simd 单元,因此您可以一次比较 2、4、8、16 个 32 位元素,并有选择地在每个元素中选择较大/较小的值。可能这些也是流水线的,因此您可以获得更高的并行性。
For sorting four byte-elements at once, you can use lookup tables.
But RAM latency will make it slow unless byte values tend to hit L1 cache.
If element values are very limited like integers 0,1,2,3 then it needs only 2 bits per value. Then you can do sorting at 4 per operation using only 256 sized lookup table which fits L1.
If values are just 0 and 1, then you dont even need lookup table. Just count the 1s in the variable, there are cpu instructions to do it in O(1).
Cpu pipelines instructions so as long as you compare(&swap) independent elements, you get parallelism depending on depth of pipeline.
Cpu has simd units so that you can compare 2,4,8,16 32bit elements at once and selectively choose greater/lesser in each one. Possibly these are also pipelined so you get way higher parallelism.
是一样的。当您谈论比较排序时,它实际上是一个决策树。因此,对多少个元素进行排序并不重要,限制来自于比较 2 个元素的要求。
It is the same. While you are talking about comparative sorting it is actually a decision tree. Therefore it doesn't really matter how many elements you are sorting the limit comes from the requirement of comparing 2 elements.