注意:这篇文章于 2011-06-10 完全重写;感谢彼得帮助我。另外,如果我不接受一个答案,请不要生气,因为这个问题似乎是相当开放式的。 (但是,如果你解决了它,当然你会得到复选标记)。
另一位用户发布了有关并行合并排序的问题。我以为我会写一个简单的解决方案,但遗憾的是,它并不比顺序版本快多少。
问题陈述
合并排序是一种分而治之的算法,其中计算的叶子可以并行化。
代码的工作原理如下:将列表转换为树,表示计算节点。然后,合并步骤返回每个节点的列表。从理论上讲,我们应该会看到一些显着的性能提升,因为我们正在从 O(n log n) 算法转变为具有无限处理器的 O(n) 算法。
当参数l(级别)大于零时,计算的第一步是并行的。这是通过[通过变量strat]选择rpar策略来完成的,这将使子计算mergeSort' x与并行发生>mergeSort' y。然后,我们合并结果,并使用rdeepseq强制对其进行评估。
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
通过仅评估几个级别的计算,我们也应该具有相当好的并行通信复杂性——一些常数因子阶数 n。
结果
在此处获取第四版源代码[ http://pastebin.com/DxYneAaC ],并运行它以下检查线程使用情况,或后续命令行进行基准测试,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
24 核 X5680 @ 3.33GHz 上的结果显示几乎没有改进
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
,而在我自己的机器上,四核Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
在 threadscope 中检查结果显示出对少量数据的良好利用率。 (尽管遗憾的是,没有明显的加速)。然而,当我尝试在更大的列表上运行它时(如上面所示),它一半的时间使用大约 2 个 cpu。看起来很多火花都被修剪掉了。它对内存参数也很敏感,其中 256mb 是最佳位置,128mb 为 9 秒,512 为 8.4,1024 为 12.3!
我正在寻找的解决方案
最后,如果有人知道一些强大的工具可以解决这个问题,我将不胜感激。 (伊甸园?)。我对 Haskell 并行性的主要兴趣是能够为研究项目编写小型支持工具,我可以将其放在我们实验室集群中的 24 或 80 核服务器上。由于它们不是我们组研究的重点,所以我不想在并行化效率上花太多时间。所以,对我来说,越简单越好,即使我最终只获得了 20% 的使用率。
进一步讨论
- 我注意到 threadscope 中的第二个栏有时是绿色的(参见其 主页,其中第二个栏似乎始终是垃圾收集)。这意味着什么?
- 有什么办法可以避免垃圾收集吗?看来要花很多时间了。例如,为什么不能分叉子计算,将结果返回到共享内存中,然后死掉?
- 有没有更好的方式(箭头,应用)来表达并行性?
Note: This post was completely rewritten 2011-06-10; thanks to Peter for helping me out. Also, please don't be offended if I don't accept one answer, since this question seems to be rather open-ended. (But, if you solve it, you get the check mark, of course).
Another user had posted a question about parallelizing a merge sort. I thought I'd write a simple solution, but alas, it is not much faster than the sequential version.
Problem statement
Merge sort is a divide-and-conquer algorithm, where the leaves of computation can be parallelized.
The code works as follows: the list is converted into a tree, representing computation nodes. Then, the merging step returns a list for each node. Theoretically, we should see some significant performanc gains, since we're going from an O(n log n) algorithm to an O(n) algorithm with infinite processors.
The first steps of the computation are parallelized, when parameter l (level) is greater than zero below. This is done by [via variable strat] selecting the rpar strategy, which will make sub-computation mergeSort' x occur in parallel with mergeSort' y. Then, we merge the results, and force its evaluation with rdeepseq.
data Tree a = Leaf a | Node (Tree a) (Tree a) deriving (Show)
instance NFData a => NFData (Tree a) where
rnf (Leaf v) = deepseq v ()
rnf (Node x y) = deepseq (x, y) ()
listToTree [] = error "listToTree -- empty list"
listToTree [x] = Leaf x
listToTree xs = uncurry Node $ listToTree *** listToTree $
splitAt (length xs `div` 2) xs
-- mergeSort' :: Ord a => Tree a -> Eval [a]
mergeSort' l (Leaf v) = return [v]
mergeSort' l (Node x y) = do
xr <- strat $ runEval $ mergeSort' (l - 1) x
yr <- rseq $ runEval $ mergeSort' (l - 1) y
rdeepseq (merge xr yr)
where
merge [] y = y
merge x [] = x
merge (x:xs) (y:ys) | x < y = x : merge xs (y:ys)
| otherwise = y : merge (x:xs) ys
strat | l > 0 = rpar
| otherwise = rseq
mergeSort = runEval . mergeSort' 10
By only evaluating a few levels of the computation, we should have decent parallel communication complexity as well -- some constant factor order of n.
Results
Obtain the 4th version source code here [ http://pastebin.com/DxYneAaC ], and run it with the following to inspect thread usage, or subsequent command lines for benchmarking,
rm -f ParallelMergeSort; ghc -O2 -O3 -optc-O3 -optc-ffast-math -eventlog --make -rtsopts -threaded ParallelMergeSort.hs
./ParallelMergeSort +RTS -H512m -K512m -ls -N
threadscope ParallelMergeSort.eventlog
Results on a 24-core X5680 @ 3.33GHz show little improvement
> ./ParallelMergeSort
initialization: 10.461204s sec.
sorting: 6.383197s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -N
initialization: 27.94877s sec.
sorting: 5.228463s sec.
and on my own machine, a quad-core Phenom II,
> ./ParallelMergeSort
initialization: 18.943919s sec.
sorting: 10.465077s sec.
> ./ParallelMergeSort +RTS -H512m -K512m -ls -N
initialization: 22.92075s sec.
sorting: 7.431716s sec.
Inspecting the result in threadscope shows good utilization for small amounts of data. (though, sadly, no perceptible speedup). However, when I try to run it on larger lists, like the above, it uses about 2 cpus half the time. It seems like a lot of sparks are getting pruned. It's also sensitive to the memory parameters, where 256mb is the sweet spot, 128mb gives 9 seconds, 512 gives 8.4, and 1024 gives 12.3!
Solutions I'm looking for
Finally, if anyone knows some high-power tools to throw at this, I'd appreciate it. (Eden?). My primary interest in Haskell parallelism is to be able to write small supportive tools for research projects, which I can throw on a 24 or 80 core server in our lab's cluster. Since they're not the main point of our group's research, I don't want to spend much time on the parallelization efficiency. So, for me, simpler is better, even if I only end up getting 20% usage.
Further discussion
- I notice that the second bar in threadscope is sometimes green (c.f. its homepage, where the second bar seems to always be garbage collection). What does this mean?
- Is there any way to sidestep garbage collection? It seems to be taking a lot of time. For example, why can't a subcomputation be forked, return the result in shared memory, and then die?
- Is there a better way (arrows, applicative) to express parallelism?
发布评论
评论(2)
答案很简单:因为您根本没有引入并行性。
Eval
只是一个用于排序计算的 monad,您必须手动请求并行执行。你可能想要的是:这将使 Haskell 实际上为第一次计算创建一个 Spark,而不是尝试当场评估它。
标准提示也适用:
evalTraversable rseq
)。否则,您只会强制树的头部,并且大部分数据将未经评估而返回。编辑:问题编辑后,以下内容实际上不再适用
但最糟糕的部分是:您所说的算法非常有缺陷。您的顶级
seq
仅强制列表的第一个 cons-cell,这允许 GHC 使用惰性来达到很好的效果。它永远不会真正构建结果列表,只是在搜索最小元素时遍历所有结果列表(这甚至不是严格需要的,但 GHC 仅在已知最小值后才生成单元格)。因此,当您开始引入并行性并假设您在程序中的某个时刻需要整个列表时,当性能实际上急剧下降时,请不要感到惊讶...
编辑 2:对编辑的更多答案
您的程序最大的问题可能是它使用列表。如果您想要制作的不仅仅是一个玩具示例,请至少考虑使用(解压缩的)数组。如果您想进行认真的数字运算,可以考虑使用专门的库,例如 repa。
关于“进一步讨论”:
颜色代表不同的 GC 状态,我不记得是哪个了。尝试查看相关事件的事件日志。
“回避”垃圾收集的方法是首先不要产生太多垃圾,例如通过使用更好的数据结构。
好吧,如果您正在寻找强大并行化的灵感,那么可能值得一看 monad-par,它相对较新,但(我觉得)其并行行为不那么“令人惊讶”。
使用 monad-par,您的示例可能会变成这样:
因此,这里的 get 实际上会强制您指定连接点 - 并且库会在幕后自动执行所需的deepseq 。
The answer is pretty easy: Because you have at no point introduced parallelism.
Eval
is just a monad to order computations, you have to ask for things to be executed in parallel manually. What you probably want is:This will make Haskell actually create a spark for the first computation, instead of trying to evaluate it on the spot.
Standard tips also kind-of apply:
evalTraversable rseq
). Otherwise you will only force the head of the tree, and the bulk of the data will just be returned unevaluated.Edit: The following actually doesn't apply anymore after the question edit
But the worst part last: Your algorithm as you state it is very flawed. Your top-level
seq
only forces the first cons-cell of the list, which allows GHC to use lazyness to great effect. It will never actually construct the result list, just plow through all of them in a search for the minimum element (that's not even strictly needed, but GHC only produces the cell after the minimum is known).So don't be surprised when performance actually drops sharply when you start introducing parallelism under the assumptions that you need the whole list at some point in the program...
Edit 2: Some more answers to the edits
The biggest problem with your program is probably that it is using lists. If you want to make more than a toy example, consider at least using (unpacked) Arrays. If you want to go into serious number-crunching, maybe consider a specialised library like repa.
On "Further Discussion":
The colors stand for different GC states, I can't remember which. Try to look at the event log for the associated event.
The way to "sidestep" garbage collection is to not produce so much garbage in the first place, e.g. by using better data structures.
Well, if you are looking for an inspiration on robust parallelization it might be worthwhile to have a look at monad-par, which is relatively new but (I feel) less "surprising" in its parallel behaviour.
With monad-par, your example might become something like:
So here the
get
actually forces you to specify the join points - and the library does the requireddeepseq
automatically behind the scenes.我在具有这些变体的双核系统上与您在 EDIT 3 中报告的情况类似。我使用了较小的列表长度,因为我使用的是较小的计算机,使用 ghc -O2 -rtsopts -threaded MergePar.hs 编译,并使用 ./MergePar +RTS -H256M -N 运行。这可能提供一种更结构化的方式来比较性能。请注意,RTS 选项
-qa
有时有助于简单的par
变体。I had similar luck to what you report in EDIT 3 on a dual core system with these variants. I used a smaller list length because I'm on a smaller computer, compiled with
ghc -O2 -rtsopts -threaded MergePar.hs
, and ran with./MergePar +RTS -H256M -N
. This might offer a more structured way to compare performance. Note that the RTS option-qa
sometimes helps the simplepar
variants.