为什么我的 Python 合并排序这么慢?
我在理解这种行为时遇到了一些困难。 我正在使用 timeit-module 测量执行时间,并获得 10000 周期的以下结果:
- Merge : 1.22722930395
- Bubble: 0.810706578175
- Select: 0.469924766812
This is my合并排序代码:
def mergeSort(array):
if len(array) <= 1:
return array
else:
left = array[:len(array)/2]
right = array[len(array)/2:]
return merge(mergeSort(left),mergeSort(right))
def merge(array1,array2):
merged_array=[]
while len(array1) > 0 or len(array2) > 0:
if array2 and not array1:
merged_array.append(array2.pop(0))
elif (array1 and not array2) or array1[0] < array2[0]:
merged_array.append(array1.pop(0))
else:
merged_array.append(array2.pop(0))
return merged_array
编辑:
我已将列表操作更改为使用指针,并且我的测试现在使用包含 0-1000 之间的 1000 个随机数的列表。 (顺便说一句:我在这里改为只有10个周期)
结果:
- 合并:0.0574434420723
- 气泡:1.74780097558
- 选择:0.362952293025
这是我重写的合并定义:
def merge(array1, array2):
merged_array = []
pointer1, pointer2 = 0, 0
while pointer1 < len(array1) and pointer2 < len(array2):
if array1[pointer1] < array2[pointer2]:
merged_array.append(array1[pointer1])
pointer1 += 1
else:
merged_array.append(array2[pointer2])
pointer2 += 1
while pointer1 < len(array1):
merged_array.append(array1[pointer1])
pointer1 += 1
while pointer2 < len(array2):
merged_array.append(array2[pointer2])
pointer2 += 1
return merged_array
现在看起来效果很好:)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
list.pop(0)
弹出第一个元素,并且必须移动所有剩余的元素,这是一个额外的 O(n) 操作,不能发生。此外,对
list
对象进行切片会创建一个副本:这意味着您还使用 O(n * log(n)) 内存而不是 O(n)。
我看不到冒泡排序,但我敢打赌它就地工作,难怪它更快。
您需要重写它才能就地工作。不要复制原始列表的一部分,而是传递开始和结束索引。
list.pop(0)
pops the first element and has to shift all remaining ones, this is an additional O(n) operation which must not happen.Also, slicing a
list
object creates a copy:Which means you're also using O(n * log(n)) memory instead of O(n).
I can't see BubbleSort, but I bet it works in-place, no wonder it's faster.
You need to rewrite it to work in-place. Instead of copying part of original list, pass starting and ending indexes.
对于初学者:我无法在 100 个周期和大小为 10000 的列表上重现您的计时结果。本答案中讨论的所有实现的
timeit
的详尽基准(包括气泡排序 >并且您的原始片段)作为要点发布在此处。我发现单次运行的平均持续时间如下:现在,为了使您的函数更快,您可以做一些事情。
编辑:嗯,显然,我错了(感谢cwillu)。 Python 中的长度计算需要 O(1)。但是在各处删除无用的计算仍然会改善一些情况(原始合并排序:0.224888720512,无长度合并排序:0.195795390606):
第二,如这个答案,
流行(0)
是线性的。将合并重写为最后的pop()
:这又更快了:no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079
第三 - 如果您使用的是尾部调用优化的语言,那么这只会很有用,没有它,它就是一个坏主意 - 您对 merge 的调用
merge
不是 尾递归;当调用(merge)
中还有剩余工作时,它会递归调用(mergeSort left)
和(mergeSort right)
。但是您可以使用 CPS 进行尾递归合并(这将如果您不执行 tco,即使是适度的列表也会耗尽堆栈大小):
完成此操作后,您可以手动执行 TCO,以将通过递归完成的调用堆栈管理推迟到普通函数的 while 循环(蹦床,解释例如此处,最初由 Guy Steele 发明的技巧)。 蹦床和 CPS 协同工作。
您编写一个 thunking 函数,它“记录”并延迟应用程序:它接受一个函数及其参数,并返回一个返回的函数(应用于这些参数的原始函数)。
然后,您编写一个蹦床来管理对 thunk 的调用:它应用 thunk 直到 thunk 返回结果(而不是另一个 thunk)
然后剩下的就是“冻结”(thunk)来自原始 CPS 函数的所有递归调用,让蹦床以正确的顺序解开它们。现在,您的函数在每次调用时都会返回一个 thunk,而无需递归(并丢弃其自己的帧):
遗憾的是,这并没有那么快(递归合并排序:0.126505711079,这个蹦床版本:0.170638551712)。好吧,我猜递归合并排序算法的堆栈爆炸实际上是适度的:一旦你走出数组切片递归模式中最左边的路径,算法就开始返回 (&删除框架)。因此,对于 10K 大小的列表,您最多会得到一个函数堆栈 log_2(10 000) = 14 ...相当适中。
您可以以此 所以答案给出:
但是这只是快一点(蹦床合并排序: 0.170638551712,这个基于堆栈的版本:0.144994809628)。显然,堆栈构建 python 在我们原始合并排序的递归调用中所做的工作非常便宜。
最终结果如何?在我的机器上(Ubuntu natty 的 Python 2.7.1+),平均运行时间(100 次运行中 - 除了 Bubblesort -,大小为 10000 的列表,包含大小为 0-10000000 的随机整数)是:
For starters : I cannot reproduce your timing results, on 100 cycles and lists of size 10000. The exhaustive benchmark with
timeit
of all implementations discussed in this answer (including bubblesort and your original snippet) is posted as a gist here. I find the following results for the average duration of a single run :Now, to make your function faster, you can do a few things.
Edit : Well, apparently, I was wrong on that one (thanks cwillu). Length computation takes O(1) in python. But removing useless computation everywhere still improves things a bit (Original Mergesort: 0.224888720512, no-length Mergesort: 0.195795390606):
Second, as suggested in this answer,
pop(0)
is linear. Rewrite your merge topop()
at the end:This is again faster: no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079
Third - and this would only be useful as-is if you were using a language that does tail call optimization, without it , it's a bad idea - your call to merge to
merge
is not tail-recursive; it calls both(mergeSort left)
and(mergeSort right)
recursively while there is remaining work in the call(merge)
.But you can make the merge tail-recursive by using CPS (this will run out of stack size for even modest lists if you don't do tco):
Once this is done, you can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here, trick originally due to Guy Steele). Trampolining and CPS work great together.
You write a thunking function, that "records" and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).
You then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)
Then all that's left is to "freeze" (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. Your function now returns a thunk, without recursion (and discarding its own frame), at every call:
Sadly this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). OK, I guess the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack of at most log_2(10 000) = 14 ... pretty modest.
You can do slightly more involved stack-based TCO elimination in the guise of this SO answer gives:
But this goes only a tad faster (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). Apparently, the stack-building python does at the recursive calls of our original merge sort is pretty inexpensive.
The final results ? on my machine (Ubuntu natty's stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:
您的合并排序有一个很大的常数因子,您必须在大列表上运行它才能看到渐近复杂度的好处。
Your merge-sort has a big constant factor, you have to run it on large lists to see the asymptotic complexity benefit.
嗯.. 1,000 条记录?你仍然处于多项式系数的主导地位..如果我有
选择排序:15 * n ^ 2(读取)+ 5 * n^2(交换)
插入排序:5 * n ^2(读取)+ 15 * n^2(交换)
合并排序: 200 * n * log(n) (读取) 1000 * n * log(n) (合并)
你将在很长一段时间内处于一场势均力敌的竞争中..顺便说一下,排序速度快了 2 倍没什么。尝试慢 100 倍。这就是感受到真正差异的地方。尝试“在我的一生中都完成不了”的算法(已知的正则表达式需要这么长时间才能匹配简单的字符串)。
因此,请尝试 1M 或 1G 记录,如果您仍然认为合并排序效果不佳,请告诉我们。
话虽这么说……
有很多因素导致这种合并排序的成本高昂。首先,没有人在小型数据结构上运行快速排序或合并排序。在 if (len <= 1) 处,人们通常会这样写:
if (len <= 16) : (使用内联插入排序)
else: 合并排序
在每个传播级别。
由于插入排序在 n 较小时具有较小的系数成本。请注意,您 50% 的工作是在最后一英里完成的。
接下来,您不必要地运行 array1.pop(0) 而不是维护索引计数器。如果你幸运的话,python 可以有效地管理数组开始偏移量,但在其他条件相同的情况下,你正在改变输入参数
此外,你知道合并期间目标数组的大小,为什么要复制并加倍 merged_array重复.. 在函数开始时预先分配目标数组的大小.. 这将为每个合并级别至少节省十几个“克隆”。
一般来说,合并排序使用 RAM 大小的 2 倍。由于所有临时合并缓冲区,您的算法可能使用 20 倍(希望 python 可以在递归之前释放结构)。它破坏了优雅,但通常最好的合并排序算法会立即分配等于源数组大小的合并缓冲区,并且您执行复杂的地址算术(或数组索引+跨度长度)以保持合并数据-来回结构。它不会像这样的简单递归问题那么优雅,但它有点接近。
在 C 排序中,缓存一致性是最大的敌人。您需要热数据结构,以便最大化缓存。通过分配瞬态临时缓冲区(即使内存管理器返回指向热内存的指针),您将面临进行缓慢 DRAM 调用的风险(为要覆盖的数据预填充缓存行)。这是插入排序、选择排序和快速排序相对于合并排序的优势之一(当按上述方式实现时)
说到这里,像快速排序这样的代码既是自然优雅的代码,也是自然高效的代码,而且不不要浪费任何内存(在维基百科上谷歌它 - 他们有一个 javascript 实现,可以作为您的代码的基础)。从快速排序中榨取最后一盎司的性能是很困难的(特别是在脚本语言中,这就是为什么他们通常只使用 C-api 来完成这部分),并且最坏的情况是 O(n^2) 。您可以通过组合冒泡排序/快速排序来尝试并聪明地减轻最坏的情况。
快乐编码。
Umm.. 1,000 records?? You are still well within the polynomial cooefficient dominance here.. If I have
selection-sort: 15 * n ^ 2 (reads) + 5 * n^2 (swaps)
insertion-sort: 5 * n ^2 (reads) + 15 * n^2 (swaps)
merge-sort: 200 * n * log(n) (reads) 1000 * n * log(n) (merges)
You're going to be in a close race for a lonng while.. By the way, 2x faster in sorting is NOTHING. Try 100x slower. That's where the real differences are felt. Try "won't finish in my life-time" algorithms (there are known regular expressions that take this long to match simple strings).
So try 1M or 1G records and let us know if you still thing merge-sort isn't doing too well.
That being said..
There are lots of things causing this merge-sort to be expensive. First of all, nobody ever runs quick or merge sort on small scale data-structures.. Where you have if (len <= 1), people generally put:
if (len <= 16) : (use inline insertion-sort)
else: merge-sort
At EACH propagation level.
Since insertion-sort is has smaller coefficent cost at smaller sizes of n. Note that 50% of your work is done in this last mile.
Next, you are needlessly running array1.pop(0) instead of maintaining index-counters. If you're lucky, python is efficiently managing start-of-array offsets, but all else being equal, you're mutating input parameters
Also, you know the size of the target array during merge, why copy-and-double the merged_array repeatedly.. Pre-allocate the size of the target array at the start of the function.. That'll save at least a dozen 'clones' per merge-level.
In general, merge-sort uses 2x the size of RAM.. Your algorithm is probably using 20x because of all the temporary merge buffers (hopefully python can free structures before recursion). It breaks elegance, but generally the best merge-sort algorithms make an immediate allocation of a merge buffer equal to the size of the source array, and you perform complex address arithmetic (or array-index + span-length) to just keep merging data-structures back and forth. It won't be as elegent as a simple recursive problem like this, but it's somewhat close.
In C-sorting, cache-coherence is your biggest enemy. You want hot data-structures so you maximize your cache. By allocating transient temp buffers (even if the memory manager is returning pointers to hot memory) you run the risk of making slow DRAM calls (pre-filling cache-lines for data you're about to over-write). This is one advantage insertion-sort,selection-sort and quick-sort have over merge-sort (when implemented as above)
Speaking of which, something like quick-sort is both naturally-elegant code, naturally efficient-code, and doesn't waste any memory (google it on wikipedia- they have a javascript implementation from which to base your code). Squeezing the last ounce of performance out of quick-sort is hard (especially in scripting languages, which is why they generally just use the C-api to do that part), and you have a worst-case of O(n^2). You can try and be clever by doing a combination bubble-sort/quick-sort to mitigate worst-case.
Happy coding.