Haskell 中的简单合并排序并行化没有加速

发布于 2024-11-14 18:57:16 字数 3147 浏览 3 评论 0 原文

注意:这篇文章于 2011-06-10 完全重写;感谢彼得帮助我。另外,如果我不接受一个答案,请不要生气,因为这个问题似乎是相当开放式的。 (但是,如果你解决了它,当然你会得到复选标记)。

另一位用户发布了有关并行合并排序的问题。我以为我会写一个简单的解决方案,但遗憾的是,它并不比顺序版本快多少。

问题陈述

合并排序是一种分而治之的算法,其中计算的叶子可以并行化。

mergesort

代码的工作原理如下:将列表转换为树,表示计算节点。然后,合并步骤返回每个节点的列表。从理论上讲,我们应该会看到一些显着的性能提升,因为我们正在从 O(n log n) 算法转变为具有无限处理器的 O(n) 算法。

当参数l(级别)大于零时,计算的第一步是并行的。这是通过[通过变量strat]选择rpar策略来完成的,这将使子计算mergeSort' x并行发生>mergeSort' y。然后,我们合并结果,并使用rdeepseq强制对其进行评估。

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)

instance NFData a => NFData (Tree a) where
    rnf (Leaf v) = deepseq v ()
    rnf (Node x y) = deepseq (x, y) ()

listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
    splitAt (length xs `div` 2) xs

-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
    xr <- strat $ runEval $ mergeSort' (l - 1) x
    yr <- rseq $ runEval $ mergeSort' (l - 1) y
    rdeepseq (merge xr yr)
    where
        merge [] y = y
        merge x [] = x
        merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
                            | otherwise = y : merge (x:xs) ys
        strat | l > 0 = rpar
              | otherwise = rseq

mergeSort = runEval . mergeSort' 10

通过仅评估几个级别的计算,我们也应该具有相当好的并行通信复杂性——一些常数因子阶数 n。

结果

在此处获取第四版源代码[ http://pastebin.com/DxYneAaC ],并运行它以下检查线程使用情况,或后续命令行进行基准测试,

rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog

24 核 X5680 @ 3.33GHz 上的结果显示几乎没有改进

> ./ParallelMergeSort 
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.

,而在我自己的机器上,四核Phenom II,

> ./ParallelMergeSort 
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.

在 threadscope 中检查结果显示出对少量数据的良好利用率。 (尽管遗憾的是,没有明显的加速)。然而,当我尝试在更大的列表上运行它时(如上面所示),它一半的时间使用大约 2 个 cpu。看起来很多火花都被修剪掉了。它对内存参数也很敏感,其中 256mb 是最佳位置,128mb 为 9 秒,512 为 8.4,1024 为 12.3!

我正在寻找的解决方案

最后,如果有人知道一些强大的工具可以解决这个问题,我将不胜感激。 (伊甸园?)。我对 Haskell 并行性的主要兴趣是能够为研究项目编写小型支持工具,我可以将其放在我们实验室集群中的 24 或 80 核服务器上。由于它们不是我们组研究的重点,所以我不想在并行化效率上花太多时间。所以,对我来说,越简单越好,即使我最终只获得了 20% 的使用率。

进一步讨论

Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).

Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.

Problem statement

Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.

mergesort

The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.

The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)

instance NFData a => NFData (Tree a) where
    rnf (Leaf v) = deepseq v ()
    rnf (Node x y) = deepseq (x, y) ()

listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
    splitAt (length xs `div` 2) xs

-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
    xr <- strat $ runEval $ mergeSort' (l - 1) x
    yr <- rseq $ runEval $ mergeSort' (l - 1) y
    rdeepseq (merge xr yr)
    where
        merge [] y = y
        merge x [] = x
        merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
                            | otherwise = y : merge (x:xs) ys
        strat | l > 0 = rpar
              | otherwise = rseq

mergeSort = runEval . mergeSort' 10

By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.

Results

Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,

rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog

Results on a 24-core X5680 @ 3.33GHz show little improvement

> ./ParallelMergeSort 
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.

and on my own machine, a quad-core Phenom II,

> ./ParallelMergeSort 
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.

Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!

Solutions I'm looking for

Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.

Further discussion

  • I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
  • Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
  • Is there a better way (arrows, applicative) to express parallelism?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

那伤。 2024-11-21 18:57:16

答案很简单:因为您根本没有引入并行性。 Eval 只是一个用于排序计算的 monad,您必须手动请求并行执行。你可能想要的是:

do xr <- rpar $ runEval $ mergeSort' x
   yr <- rseq $ runEval $ mergeSort' y
   rseq (merge xr yr)

这将使 Haskell 实际上为第一次计算创建一个 Spark,而不是尝试当场评估它。

标准提示也适用:

  1. 应该深入评估结果(例如使用evalTraversable rseq)。否则,您只会强制树的头部,并且大部分数据将未经评估而返回。
  2. 仅仅激发一切很可能会耗尽所有收益。引入一个在较低递归级别停止产生火花的参数是一个好主意。

编辑:问题编辑后,以下内容实际上不再适用

但最糟糕的部分是:您所说的算法非常有缺陷。您的顶级 seq 仅强制列表的第一个 cons-cell,这允许 GHC 使用惰性来达到很好的效果。它永远不会真正构建结果列表,只是在搜索最小元素时遍历所有结果列表(这甚至不是严格需要的,但 GHC 仅在已知最小值后才生成单元格)。

因此,当您开始引入并行性并假设您在程序中的某个时刻需要整个列表时,当性能实际上急剧下降时,请不要感到惊讶...

编辑 2:对编辑的更多答案

您的程序最大的问题可能是它使用列表。如果您想要制作的不仅仅是一个玩具示例,请至少考虑使用(解压缩的)数组。如果您想进行认真的数字运算,可以考虑使用专门的库,例如 repa

关于“进一步讨论”:

  • 颜色代表不同的 GC 状态,我不记得是哪个了。尝试查看相关事件的事件日志。

  • “回避”垃圾收集的方法是首先不要产生太多垃圾,例如通过使用更好的数据结构。

  • 好吧,如果您正在寻找强大并行化的灵感,那么可能值得一看 monad-par,它相对较新,但(我觉得)其并行行为不那么“令人惊讶”。

使用 monad-par,您的示例可能会变成这样:

  do xr <- spawn $ mergeSort' x
     yr <- spawn $ mergeSort' y
     merge <
gt; get xr <*> get yr

因此,这里的 get 实际上会强制您指定连接点 - 并且库会在幕后自动执行所需的deepseq 。

The answer is pretty easy: Because you have at no point introduced parallelism. Eval is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:

do xr <- rpar $ runEval $ mergeSort' x
   yr <- rseq $ runEval $ mergeSort' y
   rseq (merge xr yr)

This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.

Standard tips also kind-of apply:

  1. The result should be evaluated deeply (e.g. using evalTraversable rseq). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.
  2. Just sparking everything will most likely eat up any gains. It would be a good idea to introduce a parameter that stops sparking at lower recursion levels.

Edit: The following actually doesn't apply anymore after the question edit

But the worst part last: Your algorithm as you state it is very flawed. Your top-level seq only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).

So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...

Edit 2: Some more answers to the edits

The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.

On "Further Discussion":

  • The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.

  • The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.

  • Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.

With monad-par, your example might become something like:

  do xr <- spawn $ mergeSort' x
     yr <- spawn $ mergeSort' y
     merge <
gt; get xr <*> get yr

So here the get actually forces you to specify the join points - and the library does the required deepseq automatically behind the scenes.

任性一次 2024-11-21 18:57:16

我在具有这些变体的双核系统上与您在 EDIT 3 中报告的情况类似。我使用了较小的列表长度,因为我使用的是较小的计算机,使用 ghc -O2 -rtsopts -threaded MergePar.hs 编译,并使用 ./MergePar +RTS -H256M -N 运行。这可能提供一种更结构化的方式来比较性能。请注意,RTS 选项 -qa 有时有助于简单的 par 变体。

import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show

listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
  where half = length xs `div` 2

-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
                    | otherwise = y : merge (x:xs) ys

-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)

mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree

-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
                             yr = mergeSortP' y
                         in xr `par` yr `pseq` merge xr yr

mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree

-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) = 
  runEval $ merge <
gt; rpar (mergeSortR' x) <*> rpar (mergeSortR' y)

mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree

-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t@(Node x y)
  | n <= 1 = mergeSort' t
  | otherwise = let xr = smartMerge' (n-1) x
                    yr = smartMerge' (n-2) y
                in xr `par` yr `pseq` merge xr yr

smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree

main = defaultMain $ [ bench "original" $ nf mergeSort lst
                     , bench "par" $ nf mergeSortP lst
                     , bench "rpar" $ nf mergeSortR lst
                     , bench "smart" $ nf smartMerge lst ]
  where lst = [100000,99999..0] :: [Int]

I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with ghc -O2 -rtsopts -threaded MergePar.hs, and ran with ./MergePar +RTS -H256M -N. This might offer a more structured way to compare performance. Note that the RTS option -qa sometimes helps the simple par variants.

import Control.Applicative
import Control.Parallel
import Control.Parallel.Strategies
import Criterion.Main
import GHC.Conc (numCapabilities)

data Tree a = Leaf a | Node (Tree a) (Tree a) deriving Show

listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = Node (listToTree (take half xs)) (listToTree (drop half xs))
  where half = length xs `div` 2

-- Merge two ordered lists
merge :: Ord a => [a] -> [a] -> [a]
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
                    | otherwise = y : merge (x:xs) ys

-- Simple merge sort
mergeSort' :: Ord a => Tree a -> [a]
mergeSort' (Leaf v) = [v]
mergeSort' (Node x y) = merge (mergeSort' x) (mergeSort' y)

mergeSort :: Ord a => [a] -> [a]
mergeSort = mergeSort' . listToTree

-- Merge sort with 'par' annotations on every recursive call
mergeSortP' :: Ord a => Tree a -> [a]
mergeSortP' (Leaf v) = [v]
mergeSortP' (Node x y) = let xr = mergeSortP' x
                             yr = mergeSortP' y
                         in xr `par` yr `pseq` merge xr yr

mergeSortP :: Ord a => [a] -> [a]
mergeSortP = mergeSortP' . listToTree

-- Merge sort with 'rpar' annotations on every recursive call
mergeSortR' :: Ord a => Tree a -> [a]
mergeSortR' (Leaf v) = [v]
mergeSortR' (Node x y) = 
  runEval $ merge <
gt; rpar (mergeSortR' x) <*> rpar (mergeSortR' y)

mergeSortR :: Ord a => [a] -> [a]
mergeSortR = mergeSortR' . listToTree

-- Parallel merge sort that stops looking for parallelism at a certain
-- depth
smartMerge' :: Ord a => Int -> Tree a -> [a]
smartMerge' _ (Leaf v) = [v]
smartMerge' n t@(Node x y)
  | n <= 1 = mergeSort' t
  | otherwise = let xr = smartMerge' (n-1) x
                    yr = smartMerge' (n-2) y
                in xr `par` yr `pseq` merge xr yr

smartMerge :: Ord a => [a] -> [a]
smartMerge = smartMerge' numCapabilities . listToTree

main = defaultMain $ [ bench "original" $ nf mergeSort lst
                     , bench "par" $ nf mergeSortP lst
                     , bench "rpar" $ nf mergeSortR lst
                     , bench "smart" $ nf smartMerge lst ]
  where lst = [100000,99999..0] :: [Int]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文