为什么对于链表排序,合并排序优于快速排序
我在论坛上读到以下内容:
归并排序对于以下情况非常有效 不可变的数据结构,例如链接 列表
和
快速排序通常比 数据存储时进行归并排序 记忆。然而,当数据集为 巨大并存储在外部设备上 比如硬盘,归并排序是 就速度而言,显然是赢家。它 最大限度地减少昂贵的读取 外部驱动器
和
在链表上操作时,归并排序只需要少量常量的辅助存储
有人可以帮助我理解上面的论点吗?为什么合并排序优先用于对大型链表进行排序?它如何最大限度地减少对外部驱动器的昂贵读取?基本上我想理解为什么人们会选择合并排序来对大链接列表进行排序。
I read the following in a forum :
Merge sort is very efficient for
immutable datastructures like linked
lists
and
Quick sort is typically faster than
merge sort when the data is stored in
memory. However, when the data set is
huge and is stored on external devices
such as a hard drive, merge sort is
the clear winner in terms of speed. It
minimizes the expensive reads of the
external drive
and
when operating on linked lists, merge sort only requires a small constant amount of auxiliary storage
Can someone help me understand the above argument? why is merge sort preferred for sorting huge linked lists? and how does it minimize expensive reads to an external drive? basically i want to understand why one would choose merge sort for sorting a big linked list.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
快速排序非常适合就地排序。特别是,大多数操作可以根据交换数组中的元素对来定义。然而,要做到这一点,您通常使用两个指针(或索引等)“遍历”数组,一个从数组的开头开始,另一个从数组的末尾开始。然后,两者都向中间移动(当它们相遇时,您就完成了特定的分区步骤)。这对于文件来说是昂贵的,因为文件主要面向一个方向,从头到尾读取。从末尾开始向后查找通常成本相对较高。
至少在最简单的形式中,合并排序几乎是相反的。实现它的简单方法只需要朝一个方向查看数据,但是需要将数据分成两个单独的部分,对这些部分进行排序,然后将它们合并在一起。
使用链表,可以轻松地在一个链表中获取(例如)交替元素,并操作这些链接以从这些相同元素创建两个链表。对于数组,如果您愿意创建与原始数据一样大的副本,则重新排列元素以便将交替的元素放入单独的数组中很容易,但否则就更不简单了。
同样,如果您将源数组中的元素按顺序合并到一个包含数据的新数组中,则与数组合并很容易,但如果要在不创建全新数据副本的情况下就地进行合并,则完全是另一回事。使用链表,将两个源列表中的元素合并到一个目标列表中是微不足道的——同样,您只需操作链接,而不复制元素。
至于使用快速排序为外部合并排序生成排序运行,它确实有效,但通常来说它(肯定)不是最优的。为了优化合并排序,您通常希望在生成每个排序“运行”时最大化它的长度。如果您只是读入适合内存的数据,对其进行快速排序并将其写出,则每次运行将被限制为(略小于)可用内存的大小。
不过,通常来说,你可以做得比这更好。您首先读取一个数据块,但不是对其使用快速排序,而是构建一个堆。然后,当您将堆中的每个项目写入已排序的“运行”文件时,您会从输入文件中读取另一个项目。如果它比您刚刚写入磁盘的项目大,则将其插入现有堆中,然后重复。
较小的项目(即,属于已写入的项目之前)保持独立,并构建到第二个堆中。当(且仅当)您的第一个堆为空,并且第二个堆已接管所有内存时,您将停止将项目写入现有的“运行”文件,并开始一个新的文件。
其效果究竟如何取决于数据的初始顺序。在最坏的情况下(输入按逆序排序)它根本没有任何好处。在最好的情况下(输入已经排序),它可以让您在一次输入运行中对数据进行“排序”。在平均情况下(以随机顺序输入),它可以让您将每个排序运行的长度大约加倍,这通常会将速度提高大约 20-25%(尽管百分比会根据更大的大小而变化)您的数据小于可用内存)。
Quick sort works well for sorting in-place. In particular, most of the operations can be defined in terms of swapping pairs of elements in an array. To do that, however, you normally "walk" through the array with two pointers (or indexes, etc.) One starts at the beginning of the array and the other at the end. Both then work their way toward the middle (and you're done with a particular partition step when they meet). That's expensive with files, because files are oriented primarily toward reading in one direction, from beginning to end. Starting from the end and seeking backwards is usually relatively expensive.
At least in its simplest incarnation, merge sort is pretty much the opposite. The easy way to implement it only requires looking through the data in one direction, but involves breaking the data into two separate pieces, sorting the pieces, then merging them back together.
With a linked list, it's easy to take (for example) alternating elements in one linked list, and manipulate the links to create two linked lists from those same elements instead. With an array, rearranging elements so alternating elements go into separate arrays is easy if you're willing to create a copy as big as the original data, but otherwise rather more non-trivial.
Likewise, merging with arrays is easy if you merge elements from the source arrays into a new array with the data in order -- but to do it in place without creating a whole new copy of the data is a whole different story. With a linked list, merging elements together from two source lists into a single target list is trivial -- again, you just manipulate links, without copying elements.
As for using Quicksort to produce the sorted runs for an external merge sort, it does work, but it's (decidedly) sub-optimal as a rule. To optimize a merge-sort, you normally want to maximize the lengths of each sorted "run" as you produce it. If you simply read in the data that will fit in memory, Quicksort it and write it out, each run will be restricted to (a little less than) the size of the available memory.
You can do quite a bit better than that as a rule though. You start by reading in a block of data, but instead of using a Quicksort on it, you build a heap. Then, as you write each item out from the heap into the sorted "run" file, you read another item in from your input file. If it's larger than the item you just wrote to disk, you insert it into your existing heap, and repeat.
Items that are smaller (i.e., belong before items that have already been written) you keep separate, and build into a second heap. When (and only when) your first heap is empty, and the second heap has taken over all the memory, you quit writing items to the existing "run" file, and start on a new one.
Exactly how effective this will be depends on the initial order of the data. In the worst case (input sorted in inverse order) it does no good at all. In the best case (input already sorted) it lets you "sort" the data in a single run through the input. In an average case (input in random order) it lets you approximately double the length of each sorted run, which will typically improve speed by around 20-25% (though the percentage varies depending on how much larger your data is than the available memory).
快速排序依赖于对数组或类似结构进行索引的能力。当这成为可能时,快速排序就很难被击败。
但是你不能很快地直接索引到链表中。也就是说,如果
myList
是一个链接列表,那么myList[x]
(如果可以编写这样的语法)将涉及从列表的头部开始并遵循第一个x
链接。对于快速排序进行的每次比较,都必须进行两次,而且很快就会变得昂贵。磁盘上的情况相同:快速排序必须查找并读取它想要比较的每个项目。
在这些情况下,合并排序速度更快,因为它按顺序读取项目,通常使 log2(N) 遍历数据。涉及的 I/O 少得多,并且在链接列表中的链接上花费的时间也少得多。
当数据适合内存并且可以直接寻址时,快速排序速度很快。当数据无法装入内存或获取某个项目的成本较高时,合并排序会更快。
请注意,大文件排序通常会将尽可能多的文件加载到内存中,然后进行快速排序并将其写入临时文件,然后重复,直到遍历整个文件。此时,存在一定数量的块,每个块都已排序,然后程序进行 N 路合并以生成排序后的输出。
Quicksort depends on being able to index into an array or similar structure. When that's possible, it's hard to beat Quicksort.
But you can't index directly into a linked list very quickly. That is, if
myList
is a linked list, thenmyList[x]
, were it possible to write such syntax, would involve starting at the head of the list and following the firstx
links. That would have to be done twice for every comparison that Quicksort makes, and that would get expensive real quick.Same thing on disk: Quicksort would have to seek and read every item it wants to compare.
Merge sort is faster in these situations because it reads the items sequentially, typically making log2(N) passes over the data. There is much less I/O involved, and much less time spent following links in a linked list.
Quicksort is fast when the data fits into memory and can be addressed directly. Mergesort is faster when data won't fit into memory or when it's expensive to get to an item.
Note that large file sorts typically load as much as they can of a file into memory, Quicksort that and write it out to a temporary file, and repeat until it has gone through the entire file. At that point there is some number of blocks, each one of which is sorted, and the program then does a N-way merge to produce the sorted output.
快速排序会将记录移动到列表的中间。为了将一项移动到索引 X,它必须从 0 开始并一次迭代一条记录。
合并排序将列表分成几个小列表,并且只比较列表头部的项目。
合并排序的设置通常比快速排序所需的迭代昂贵。然而,当列表足够大,或者读取成本很高(例如从磁盘读取)时,快速排序迭代所需的时间就成为主要因素。
A quicksort will move records in to the middle of the list. In order to move an item to index X, it has to start at 0 and iterate one record at a time.
A mergesort splits the list into several small lists and only ever compares the items head of the lists.
The setup for a merge sort is typically move expensive than the iterated required by a quicksort. However, when a list is sufficiently large, or the reads are expensive(like from a disk), the time it takes for the quicksort to iterate becomes a major factor.