当前位置：文江博客话题详情

为什么我的 Python 合并排序这么慢？

发布于 2024-11-29 14:30:21 字数 1811 浏览 6 评论 0 原文

我在理解这种行为时遇到了一些困难。我正在使用 timeit-module 测量执行时间，并获得 10000 周期的以下结果：

Merge : 1.22722930395
Bubble: 0.810706578175
Select: 0.469924766812

This is my合并排序代码：

def mergeSort(array):
    if len(array) <= 1:
        return array
    else:
        left = array[:len(array)/2]
        right = array[len(array)/2:]
        return merge(mergeSort(left),mergeSort(right))

def merge(array1,array2):
    merged_array=[]
    while len(array1) > 0 or len(array2) > 0:

        if array2 and not array1:
            merged_array.append(array2.pop(0))

        elif (array1 and not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))

        else:
            merged_array.append(array2.pop(0))
    return merged_array

编辑：

我已将列表操作更改为使用指针，并且我的测试现在使用包含 0-1000 之间的 1000 个随机数的列表。（顺便说一句：我在这里改为只有10个周期）

结果：

合并：0.0574434420723
气泡：1.74780097558
选择：0.362952293025

这是我重写的合并定义：

def merge(array1, array2):
    merged_array = []
    pointer1, pointer2 = 0, 0
    while pointer1 < len(array1) and pointer2 < len(array2):
        if array1[pointer1] < array2[pointer2]:
            merged_array.append(array1[pointer1])
            pointer1 += 1
        else:
            merged_array.append(array2[pointer2])
            pointer2 += 1
    while pointer1 < len(array1):
        merged_array.append(array1[pointer1])
        pointer1 += 1

    while pointer2 < len(array2):
        merged_array.append(array2[pointer2])
        pointer2 += 1

    return merged_array

现在看起来效果很好:)

原文

I'm having some troubles understanding this behaviour.
I'm measuring the execution time with the timeit-module and get the following results for 10000 cycles:

Merge : 1.22722930395
Bubble: 0.810706578175
Select: 0.469924766812

This is my code for MergeSort:

def mergeSort(array):
    if len(array) <= 1:
        return array
    else:
        left = array[:len(array)/2]
        right = array[len(array)/2:]
        return merge(mergeSort(left),mergeSort(right))

def merge(array1,array2):
    merged_array=[]
    while len(array1) > 0 or len(array2) > 0:

        if array2 and not array1:
            merged_array.append(array2.pop(0))

        elif (array1 and not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))

        else:
            merged_array.append(array2.pop(0))
    return merged_array

Edit:

I've changed the list operations to use pointers and my tests now work with a list of 1000 random numbers from 0-1000. (btw: I changed to only 10 cycles here)

result:

Merge : 0.0574434420723
Bubble: 1.74780097558
Select: 0.362952293025

This is my rewritten merge definition:

def merge(array1, array2):
    merged_array = []
    pointer1, pointer2 = 0, 0
    while pointer1 < len(array1) and pointer2 < len(array2):
        if array1[pointer1] < array2[pointer2]:
            merged_array.append(array1[pointer1])
            pointer1 += 1
        else:
            merged_array.append(array2[pointer2])
            pointer2 += 1
    while pointer1 < len(array1):
        merged_array.append(array1[pointer1])
        pointer1 += 1

    while pointer2 < len(array2):
        merged_array.append(array2[pointer2])
        pointer2 += 1

    return merged_array

seems to work pretty well now :)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

怎言笑 2024-12-06 14:30:21

list.pop(0) 弹出第一个元素，并且必须移动所有剩余的元素，这是一个额外的 O(n) 操作，不能发生。

此外，对 list 对象进行切片会创建一个副本：

left = array[:len(array)/2]
right = array[len(array)/2:]

这意味着您还使用 O(n * log(n)) 内存而不是 O(n)。

我看不到冒泡排序，但我敢打赌它就地工作，难怪它更快。

您需要重写它才能就地工作。不要复制原始列表的一部分，而是传递开始和结束索引。

list.pop(0) pops the first element and has to shift all remaining ones, this is an additional O(n) operation which must not happen.

Also, slicing a list object creates a copy:

left = array[:len(array)/2]
right = array[len(array)/2:]

Which means you're also using O(n * log(n)) memory instead of O(n).

I can't see BubbleSort, but I bet it works in-place, no wonder it's faster.

You need to rewrite it to work in-place. Instead of copying part of original list, pass starting and ending indexes.

回复收藏 0 原文

方圜几里 2024-12-06 14:30:21

对于初学者：我无法在 100 个周期和大小为 10000 的列表上重现您的计时结果。本答案中讨论的所有实现的 timeit 的详尽基准（包括气泡排序 >并且您的原始片段）作为要点发布在此处。我发现单次运行的平均持续时间如下：

Python 的本机 (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
(Your) Original Mergesort : 0.224888720512

现在，为了使您的函数更快，您可以做一些事情。

编辑：嗯，显然，我错了（感谢cwillu）。 Python 中的长度计算需要 O(1)。但是在各处删除无用的计算仍然会改善一些情况（原始合并排序：0.224888720512，无长度合并排序：0.195795390606）：

def nolenmerge(array1,array2):
    合并数组=[]
    而数组1或数组2：
        如果不是数组1：
            merged_array.append(array2.pop(0))
        elif（不是 array2）或 array1[0] <数组2[0]：
            merged_array.append(array1.pop(0))
        别的：
            merged_array.append(array2.pop(0))
    返回合并数组

def nolenmergeSort(数组):
    n = 长度（数组）
    如果 n <= 1：
        返回数组
    左 = 数组[:n/2]
    右 = 数组[n/2:]
    返回 nolenmerge(nolenmergeSort(左),nolenmergeSort(右))

第二，如这个答案，流行(0) 是线性的。将合并重写为最后的 pop()：

def fastmerge(array1,array2):
    合并数组=[]
    而数组1或数组2：
        如果不是数组1：
            merged_array.append(array2.pop())
        elif (不是 array2) 或 array1[-1] >数组2[-1]：
            merged_array.append(array1.pop())
        别的：
            merged_array.append(array2.pop())
    merged_array.reverse()
    返回合并数组

这又更快了：no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079

第三 - 如果您使用的是尾部调用优化的语言，那么这只会很有用，没有它，它就是一个坏主意 - 您对 merge 的调用 merge 不是尾递归；当调用 (merge) 中还有剩余工作时，它会递归调用 (mergeSort left) 和 (mergeSort right)。

但是您可以使用 CPS 进行尾递归合并（这将如果您不执行 tco，即使是适度的列表也会耗尽堆栈大小）：

def cps_merge_sort(数组): 返回 cpsmergeSort(数组,lambda x:x) def cpsmergeSort（数组，延续）： n = 长度（数组）如果 n <= 1：返回延续（数组）左 = 数组[:n/2] 右 = 数组[n/2:] 返回 cpsmergeSort (左, lambda leftR: cpsmergeSort(右, lambda rightR: 延续（快速合并（左R，右R））））

完成此操作后，您可以手动执行 TCO，以将通过递归完成的调用堆栈管理推迟到普通函数的 while 循环（蹦床，解释例如此处，最初由 Guy Steele 发明的技巧）。 蹦床和 CPS 协同工作。

您编写一个 thunking 函数，它“记录”并延迟应用程序：它接受一个函数及其参数，并返回一个返回的函数（应用于这些参数的原始函数）。

thunk = lambda 名称, *args: lambda: name(*args)

然后，您编写一个蹦床来管理对 thunk 的调用：它应用 thunk 直到 thunk 返回结果（而不是另一个 thunk）

def 蹦床（弹跳器）：当可调用时（保镖）：保镖=保镖() 返回保镖

然后剩下的就是“冻结”（thunk）来自原始 CPS 函数的所有递归调用，让蹦床以正确的顺序解开它们。现在，您的函数在每次调用时都会返回一个 thunk，而无需递归（并丢弃其自己的帧）：

def tco_cpsmergeSort（数组，延续）： n = 长度（数组）如果 n <= 1：返回延续（数组）左 = 数组[:n/2] 右 = 数组[n/2:] 返回 thunk (tco_cpsmergeSort, left, lambda leftR: thunk (tco_cpsmergeSort, 右, lambda rightR: （继续（快速合并（左R，右R））））） mycpomergesort = lambda l：蹦床（tco_cpsmergeSort（l，lambda x：x））

遗憾的是，这并没有那么快（递归合并排序：0.126505711079，这个蹦床版本：0.170638551712）。好吧，我猜递归合并排序算法的堆栈爆炸实际上是适度的：一旦你走出数组切片递归模式中最左边的路径，算法就开始返回 (&删除框架）。因此，对于 10K 大小的列表，您最多会得到一个函数堆栈 log_2(10 000) = 14 ...相当适中。

您可以以此所以答案给出：

    def leftcomb(l):
        maxn,leftcomb = len(l),[]
        n = maxn/2
        while maxn > 1:
            leftcomb.append((l[n:maxn],False))
            maxn,n = n,n/2
        return l[:maxn],leftcomb

    def tcomergesort(l):
        l,stack = leftcomb(l)
        while stack: # l sorted, stack contains tagged slices
            i,ordered = stack.pop()
            if ordered:
                l = fastmerge(l,i)
            else:
                stack.append((l,True)) # store return call
                rsub,ssub = leftcomb(i)
                stack.extend(ssub) #recurse
                l = rsub
        return l

但是这只是快一点（蹦床合并排序： 0.170638551712，这个基于堆栈的版本：0.144994809628）。显然，堆栈构建 python 在我们原始合并排序的递归调用中所做的工作非常便宜。

最终结果如何？在我的机器上（Ubuntu natty 的 Python 2.7.1+），平均运行时间（100 次运行中 - 除了 Bubblesort -，大小为 10000 的列表，包含大小为 0-10000000 的随机整数）是：

Python 的本机（Tim））排序：0.0144600081444
冒泡排序： 26.9620819092
原始合并排序：0.224888720512
无长度合并排序：0.195795390606
无长度合并排序 + 快速合并：0.126505711079
蹦床 CPS 合并排序 + 快速合并：0.170638551712
基于堆栈合并排序+快速合并：0.144994809628

For starters : I cannot reproduce your timing results, on 100 cycles and lists of size 10000. The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort and your original snippet) is posted as a gist here. I find the following results for the average duration of a single run :

Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
(Your) Original Mergesort : 0.224888720512

Now, to make your function faster, you can do a few things.

Edit : Well, apparently, I was wrong on that one (thanks cwillu). Length computation takes O(1) in python. But removing useless computation everywhere still improves things a bit (Original Mergesort: 0.224888720512, no-length Mergesort: 0.195795390606):

def nolenmerge(array1,array2):
    merged_array=[]
    while array1 or array2:
        if not array1:
            merged_array.append(array2.pop(0))
        elif (not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))
        else:
            merged_array.append(array2.pop(0))
    return merged_array

def nolenmergeSort(array):
    n  = len(array)
    if n <= 1:
        return array
    left = array[:n/2]
    right = array[n/2:]
    return nolenmerge(nolenmergeSort(left),nolenmergeSort(right))

Second, as suggested in this answer, pop(0) is linear. Rewrite your merge to pop() at the end:

def fastmerge(array1,array2):
    merged_array=[]
    while array1 or array2:
        if not array1:
            merged_array.append(array2.pop())
        elif (not array2) or array1[-1] > array2[-1]:
            merged_array.append(array1.pop())
        else:
            merged_array.append(array2.pop())
    merged_array.reverse()
    return merged_array

This is again faster: no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079

Third - and this would only be useful as-is if you were using a language that does tail call optimization, without it , it's a bad idea - your call to merge to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).

But you can make the merge tail-recursive by using CPS (this will run out of stack size for even modest lists if you don't do tco):
```
def cps_merge_sort(array):
    return cpsmergeSort(array,lambda x:x)

def cpsmergeSort(array,continuation):
    n  = len(array)
    if n <= 1:
        return continuation(array)
    left = array[:n/2]
    right = array[n/2:]
    return cpsmergeSort (left, lambda leftR:
                         cpsmergeSort(right, lambda rightR:
                                      continuation(fastmerge(leftR,rightR))))
```
Once this is done, you can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here, trick originally due to Guy Steele). Trampolining and CPS work great together.

You write a thunking function, that "records" and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).
```
thunk = lambda name, *args: lambda: name(*args)
```
You then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)
```
def trampoline(bouncer):
    while callable(bouncer):
        bouncer = bouncer()
    return bouncer
```
Then all that's left is to "freeze" (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. Your function now returns a thunk, without recursion (and discarding its own frame), at every call:
```
def tco_cpsmergeSort(array,continuation):
    n  = len(array)
    if n <= 1:
        return continuation(array)
    left = array[:n/2]
    right = array[n/2:]
    return thunk (tco_cpsmergeSort, left, lambda leftR:
                  thunk (tco_cpsmergeSort, right, lambda rightR:
                         (continuation(fastmerge(leftR,rightR)))))

mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x))
```

Sadly this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). OK, I guess the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack of at most log_2(10 000) = 14 ... pretty modest.

You can do slightly more involved stack-based TCO elimination in the guise of this SO answer gives:

    def leftcomb(l):
        maxn,leftcomb = len(l),[]
        n = maxn/2
        while maxn > 1:
            leftcomb.append((l[n:maxn],False))
            maxn,n = n,n/2
        return l[:maxn],leftcomb

    def tcomergesort(l):
        l,stack = leftcomb(l)
        while stack: # l sorted, stack contains tagged slices
            i,ordered = stack.pop()
            if ordered:
                l = fastmerge(l,i)
            else:
                stack.append((l,True)) # store return call
                rsub,ssub = leftcomb(i)
                stack.extend(ssub) #recurse
                l = rsub
        return l

But this goes only a tad faster (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). Apparently, the stack-building python does at the recursive calls of our original merge sort is pretty inexpensive.

The final results ? on my machine (Ubuntu natty's stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
Original Mergesort : 0.224888720512
no-len Mergesort : 0.195795390606
no-len Mergesort + fastmerge : 0.126505711079
trampolined CPS Mergesort + fastmerge : 0.170638551712
stack-based mergesort + fastmerge: 0.144994809628

回复收藏 0 原文

恋竹姑娘 2024-12-06 14:30:21

您的合并排序有一个很大的常数因子，您必须在大列表上运行它才能看到渐近复杂度的好处。

回复收藏 0 原文

很酷不放纵 2024-12-06 14:30:21

嗯.. 1,000 条记录？你仍然处于多项式系数的主导地位..如果我有
选择排序：15 * n ^ 2（读取）+ 5 * n^2（交换）
插入排序：5 * n ^2（读取）+ 15 * n^2（交换）
合并排序： 200 * n * log(n) （读取） 1000 * n * log(n) （合并）

你将在很长一段时间内处于一场势均力敌的竞争中..顺便说一下，排序速度快了 2 倍没什么。尝试慢 100 倍。这就是感受到真正差异的地方。尝试“在我的一生中都完成不了”的算法（已知的正则表达式需要这么长时间才能匹配简单的字符串）。

因此，请尝试 1M 或 1G 记录，如果您仍然认为合并排序效果不佳，请告诉我们。

话虽这么说……

有很多因素导致这种合并排序的成本高昂。首先，没有人在小型数据结构上运行快速排序或合并排序。在 if (len <= 1) 处，人们通常会这样写：
if (len <= 16) : (使用内联插入排序)
else: 合并排序
在每个传播级别。

由于插入排序在 n 较小时具有较小的系数成本。请注意，您 50% 的工作是在最后一英里完成的。

接下来，您不必要地运行 array1.pop(0) 而不是维护索引计数器。如果你幸运的话，python 可以有效地管理数组开始偏移量，但在其他条件相同的情况下，你正在改变输入参数

此外，你知道合并期间目标数组的大小，为什么要复制并加倍 merged_array重复.. 在函数开始时预先分配目标数组的大小.. 这将为每个合并级别至少节省十几个“克隆”。

一般来说，合并排序使用 RAM 大小的 2 倍。由于所有临时合并缓冲区，您的算法可能使用 20 倍（希望 python 可以在递归之前释放结构）。它破坏了优雅，但通常最好的合并排序算法会立即分配等于源数组大小的合并缓冲区，并且您执行复杂的地址算术（或数组索引+跨度长度）以保持合并数据-来回结构。它不会像这样的简单递归问题那么优雅，但它有点接近。

在 C 排序中，缓存一致性是最大的敌人。您需要热数据结构，以便最大化缓存。通过分配瞬态临时缓冲区（即使内存管理器返回指向热内存的指针），您将面临进行缓慢 DRAM 调用的风险（为要覆盖的数据预填充缓存行）。这是插入排序、选择排序和快速排序相对于合并排序的优势之一（当按上述方式实现时）

说到这里，像快速排序这样的代码既是自然优雅的代码，也是自然高效的代码，而且不不要浪费任何内存（在维基百科上谷歌它 - 他们有一个 javascript 实现，可以作为您的代码的基础）。从快速排序中榨取最后一盎司的性能是很困难的（特别是在脚本语言中，这就是为什么他们通常只使用 C-api 来完成这部分），并且最坏的情况是 O(n^2) 。您可以通过组合冒泡排序/快速排序来尝试并聪明地减轻最坏的情况。

快乐编码。

Umm.. 1,000 records?? You are still well within the polynomial cooefficient dominance here.. If I have
selection-sort: 15 * n ^ 2 (reads) + 5 * n^2 (swaps)
insertion-sort: 5 * n ^2 (reads) + 15 * n^2 (swaps)
merge-sort: 200 * n * log(n) (reads) 1000 * n * log(n) (merges)

You're going to be in a close race for a lonng while.. By the way, 2x faster in sorting is NOTHING. Try 100x slower. That's where the real differences are felt. Try "won't finish in my life-time" algorithms (there are known regular expressions that take this long to match simple strings).

So try 1M or 1G records and let us know if you still thing merge-sort isn't doing too well.

That being said..

There are lots of things causing this merge-sort to be expensive. First of all, nobody ever runs quick or merge sort on small scale data-structures.. Where you have if (len <= 1), people generally put:
if (len <= 16) : (use inline insertion-sort)
else: merge-sort
At EACH propagation level.

Since insertion-sort is has smaller coefficent cost at smaller sizes of n. Note that 50% of your work is done in this last mile.

Next, you are needlessly running array1.pop(0) instead of maintaining index-counters. If you're lucky, python is efficiently managing start-of-array offsets, but all else being equal, you're mutating input parameters

Also, you know the size of the target array during merge, why copy-and-double the merged_array repeatedly.. Pre-allocate the size of the target array at the start of the function.. That'll save at least a dozen 'clones' per merge-level.

In general, merge-sort uses 2x the size of RAM.. Your algorithm is probably using 20x because of all the temporary merge buffers (hopefully python can free structures before recursion). It breaks elegance, but generally the best merge-sort algorithms make an immediate allocation of a merge buffer equal to the size of the source array, and you perform complex address arithmetic (or array-index + span-length) to just keep merging data-structures back and forth. It won't be as elegent as a simple recursive problem like this, but it's somewhat close.

In C-sorting, cache-coherence is your biggest enemy. You want hot data-structures so you maximize your cache. By allocating transient temp buffers (even if the memory manager is returning pointers to hot memory) you run the risk of making slow DRAM calls (pre-filling cache-lines for data you're about to over-write). This is one advantage insertion-sort,selection-sort and quick-sort have over merge-sort (when implemented as above)

Speaking of which, something like quick-sort is both naturally-elegant code, naturally efficient-code, and doesn't waste any memory (google it on wikipedia- they have a javascript implementation from which to base your code). Squeezing the last ounce of performance out of quick-sort is hard (especially in scripting languages, which is why they generally just use the C-api to do that part), and you have a worst-case of O(n^2). You can try and be clever by doing a combination bubble-sort/quick-sort to mitigate worst-case.

Happy coding.

回复收藏 0 原文

~没有更多了~