当前位置：文江博客话题详情

为什么对于链表排序，合并排序优于快速排序

发布于 2024-10-20 20:11:43 字数 412 浏览 2 评论 0原文

我在论坛上读到以下内容：

归并排序对于以下情况非常有效不可变的数据结构，例如链接列表

和

快速排序通常比数据存储时进行归并排序记忆。然而，当数据集为巨大并存储在外部设备上比如硬盘，归并排序是就速度而言，显然是赢家。它最大限度地减少昂贵的读取外部驱动器

和

在链表上操作时，归并排序只需要少量常量的辅助存储

有人可以帮助我理解上面的论点吗？为什么合并排序优先用于对大型链表进行排序？它如何最大限度地减少对外部驱动器的昂贵读取？基本上我想理解为什么人们会选择合并排序来对大链接列表进行排序。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

羞稚 2024-10-27 20:11:43

快速排序非常适合就地排序。特别是，大多数操作可以根据交换数组中的元素对来定义。然而，要做到这一点，您通常使用两个指针（或索引等）“遍历”数组，一个从数组的开头开始，另一个从数组的末尾开始。然后，两者都向中间移动（当它们相遇时，您就完成了特定的分区步骤）。这对于文件来说是昂贵的，因为文件主要面向一个方向，从头到尾读取。从末尾开始向后查找通常成本相对较高。

至少在最简单的形式中，合并排序几乎是相反的。实现它的简单方法只需要朝一个方向查看数据，但是需要将数据分成两个单独的部分，对这些部分进行排序，然后将它们合并在一起。

使用链表，可以轻松地在一个链表中获取（例如）交替元素，并操作这些链接以从这些相同元素创建两个链表。对于数组，如果您愿意创建与原始数据一样大的副本，则重新排列元素以便将交替的元素放入单独的数组中很容易，但否则就更不简单了。

同样，如果您将源数组中的元素按顺序合并到一个包含数据的新数组中，则与数组合并很容易，但如果要在不创建全新数据副本的情况下就地进行合并，则完全是另一回事。使用链表，将两个源列表中的元素合并到一个目标列表中是微不足道的——同样，您只需操作链接，而不复制元素。

至于使用快速排序为外部合并排序生成排序运行，它确实有效，但通常来说它（肯定）不是最优的。为了优化合并排序，您通常希望在生成每个排序“运行”时最大化它的长度。如果您只是读入适合内存的数据，对其进行快速排序并将其写出，则每次运行将被限制为（略小于）可用内存的大小。

不过，通常来说，你可以做得比这更好。您首先读取一个数据块，但不是对其使用快速排序，而是构建一个堆。然后，当您将堆中的每个项目写入已排序的“运行”文件时，您会从输入文件中读取另一个项目。如果它比您刚刚写入磁盘的项目大，则将其插入现有堆中，然后重复。

较小的项目（即，属于已写入的项目之前）保持独立，并构建到第二个堆中。当（且仅当）您的第一个堆为空，并且第二个堆已接管所有内存时，您将停止将项目写入现有的“运行”文件，并开始一个新的文件。

其效果究竟如何取决于数据的初始顺序。在最坏的情况下（输入按逆序排序）它根本没有任何好处。在最好的情况下（输入已经排序），它可以让您在一次输入运行中对数据进行“排序”。在平均情况下（以随机顺序输入），它可以让您将每个排序运行的长度大约加倍，这通常会将速度提高大约 20-25%（尽管百分比会根据更大的大小而变化）您的数据小于可用内存）。

Quick sort works well for sorting in-place. In particular, most of the operations can be defined in terms of swapping pairs of elements in an array. To do that, however, you normally "walk" through the array with two pointers (or indexes, etc.) One starts at the beginning of the array and the other at the end. Both then work their way toward the middle (and you're done with a particular partition step when they meet). That's expensive with files, because files are oriented primarily toward reading in one direction, from beginning to end. Starting from the end and seeking backwards is usually relatively expensive.

At least in its simplest incarnation, merge sort is pretty much the opposite. The easy way to implement it only requires looking through the data in one direction, but involves breaking the data into two separate pieces, sorting the pieces, then merging them back together.

With a linked list, it's easy to take (for example) alternating elements in one linked list, and manipulate the links to create two linked lists from those same elements instead. With an array, rearranging elements so alternating elements go into separate arrays is easy if you're willing to create a copy as big as the original data, but otherwise rather more non-trivial.

Likewise, merging with arrays is easy if you merge elements from the source arrays into a new array with the data in order -- but to do it in place without creating a whole new copy of the data is a whole different story. With a linked list, merging elements together from two source lists into a single target list is trivial -- again, you just manipulate links, without copying elements.

As for using Quicksort to produce the sorted runs for an external merge sort, it does work, but it's (decidedly) sub-optimal as a rule. To optimize a merge-sort, you normally want to maximize the lengths of each sorted "run" as you produce it. If you simply read in the data that will fit in memory, Quicksort it and write it out, each run will be restricted to (a little less than) the size of the available memory.

You can do quite a bit better than that as a rule though. You start by reading in a block of data, but instead of using a Quicksort on it, you build a heap. Then, as you write each item out from the heap into the sorted "run" file, you read another item in from your input file. If it's larger than the item you just wrote to disk, you insert it into your existing heap, and repeat.

Items that are smaller (i.e., belong before items that have already been written) you keep separate, and build into a second heap. When (and only when) your first heap is empty, and the second heap has taken over all the memory, you quit writing items to the existing "run" file, and start on a new one.

Exactly how effective this will be depends on the initial order of the data. In the worst case (input sorted in inverse order) it does no good at all. In the best case (input already sorted) it lets you "sort" the data in a single run through the input. In an average case (input in random order) it lets you approximately double the length of each sorted run, which will typically improve speed by around 20-25% (though the percentage varies depending on how much larger your data is than the available memory).

回复收藏 0 原文