为什么我的平行代码运行效果很差?

发布于 2025-02-09 15:16:32 字数 2357 浏览 2 评论 0 原文

我一直在尝试使我的haskell代码并行,并且它的速度越慢,所以我制作了一些示例代码以显示我的问题 这是串行代码:

module Main where

import System.Environment

sumRangeSquares :: (Num a, Enum a) => a -> a -> a
sumRangeSquares start end = sum $ map (^2) [start .. end]

main :: IO ()
main = do
    [start, end] <- map read <$> getArgs
    print $ sumRangeSquares start end

stack ghc汇编 - -O2 -RTSOPTS -eventLog -threaded src/main.hs time ./src/main 1 10000000 ,它在大约0.4秒内完成

,现在显而易见的平行对应物是:

module Main where

import Control.Parallel.Strategies
import System.Environment

sumRangeSquares :: (Num a, Enum a) => a -> a -> a
sumRangeSquares start end = sum $ parMap rseq (^2) [start .. end]

main :: IO ()
main = do
    [start, end] <- map read <$> getArgs
    print $ sumRangeSquares start end

以相同的方式编译,并使用 time ./src/main 1 10000000 +RTS -N4 -LF -S 需要6秒以上的时间

。日志由 -s

   2,661,959,552 bytes allocated in the heap
   1,891,228,032 bytes copied during GC
     468,753,512 bytes maximum residency (12 sample(s))
     307,102,616 bytes maximum slop
            1226 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1837 colls,  1837 par   10.483s   2.705s     0.0015s    0.0080s
  Gen  1        12 colls,    11 par    5.157s   1.391s     0.1159s    0.5573s

  Parallel GC work balance: 26.09% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 10000000 (9998153 converted, 1847 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.038s  (  0.038s elapsed)
  MUT     time    6.995s  (  2.158s elapsed)
  GC      time   15.639s  (  4.096s elapsed)
  EXIT    time    0.001s  (  0.005s elapsed)
  Total   time   22.673s  (  6.297s elapsed)

  Alloc rate    380,577,209 bytes per MUT second

  Productivity  30.8% of total user, 34.3% of total elapsed


real    0m6.374s
user    0m16.889s
sys 0m5.859s

这是 threadscope main.eventlog 中所示的事件日志。

。 HEC在相对同一时间运行并闲置。此外,还有很多漫长的闲置时间,以及不平衡的火花池和火花创造。

I've been trying to parallelize my Haskell code and it has just been getting slower, so i made some sample code to show my problem
here is the serial code:

module Main where

import System.Environment

sumRangeSquares :: (Num a, Enum a) => a -> a -> a
sumRangeSquares start end = sum $ map (^2) [start .. end]

main :: IO ()
main = do
    [start, end] <- map read <
gt; getArgs
    print $ sumRangeSquares start end

Compiled with stack ghc -- -O2 -rtsopts -eventlog -threaded src/Main.hs and ran with time ./src/Main 1 10000000, it completes in about 0.4 seconds

Now the obvious parallel counterpart is:

module Main where

import Control.Parallel.Strategies
import System.Environment

sumRangeSquares :: (Num a, Enum a) => a -> a -> a
sumRangeSquares start end = sum $ parMap rseq (^2) [start .. end]

main :: IO ()
main = do
    [start, end] <- map read <
gt; getArgs
    print $ sumRangeSquares start end

Compiled the same way and ran with time ./src/Main 1 10000000 +RTS -N4 -lf -s takes over 6 seconds

Here's the log created by -s:

   2,661,959,552 bytes allocated in the heap
   1,891,228,032 bytes copied during GC
     468,753,512 bytes maximum residency (12 sample(s))
     307,102,616 bytes maximum slop
            1226 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1837 colls,  1837 par   10.483s   2.705s     0.0015s    0.0080s
  Gen  1        12 colls,    11 par    5.157s   1.391s     0.1159s    0.5573s

  Parallel GC work balance: 26.09% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 10000000 (9998153 converted, 1847 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.038s  (  0.038s elapsed)
  MUT     time    6.995s  (  2.158s elapsed)
  GC      time   15.639s  (  4.096s elapsed)
  EXIT    time    0.001s  (  0.005s elapsed)
  Total   time   22.673s  (  6.297s elapsed)

  Alloc rate    380,577,209 bytes per MUT second

  Productivity  30.8% of total user, 34.3% of total elapsed


real    0m6.374s
user    0m16.889s
sys 0m5.859s

And here is the event log as seen in threadscope Main.eventlog.
event log of parallel code. all four HEC run and go idle at relatively the same times, with long idle times and unbalanced spark pools and spark creations

As shown in the image, there is a lot of idle time and all four HECs run and idle at relatively the same times. Furthermore, there's lots of long idle times, and unbalanced spark pools and spark creations.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

丢了幸福的猪 2025-02-16 15:16:32

创建新的CPU线程的成本很高,您要求为每个小型计算创建一个新线程。两个整数的产品的成本要少于创建新线程。因此,您的机器正忙于创建和杀死新线程,而不是做有用的工作。

当您拥有CPU时,您必须给它少量昂贵的工作才能提高性能。

这可能是尴尬但足够的例子:我们将 sumrangesquare 与顺序变体相同,然后将我们的范围分为4件,然后使用 sumrangesquares ,然后运行4个并行线程,然后总和4输出最终结果。

module Main where

import Control.Parallel.Strategies
import System.Environment

sumRangeSquares :: (Integer, Integer) -> Integer
sumRangeSquares (start, end) = sum $ map (^2) [start .. end]

main :: IO ()
main = do
    [start, end] <- map (read :: (String -> Integer)) <
gt; getArgs
    let space = [(start+(i-1)*(div (end-start) 4), start+i*(div (end-start) 4)) | i <- [1..3]]
    print $ sum $ parMap rseq sumRangeSquares (space ++ [(snd $ last space, end)])

我使用1和30 000 000作为ARG获得更重要的结果,我为您提供了顺序变体:

time ./app/Main 1 30000000

real    0m1,353s
user    0m1,350s
sys     0m0,004s

这是我并行的,使用一个线程运行:

time ./app/Main 1 30000000 +RTS -N1 -lf

real    0m1,334s
user    0m1,311s
sys     0m0,022s

这是我并行的,请使用四个线程运行:

time ./app/Main 1 30000000 +RTS -N4 -lf

real    0m0,416s
user    0m1,386s
sys     0m0,024s

The cost of creating a new CPU thread is high and you are requesting to create a new thread for every tiny computation. The product of two integer costs much less then creating a new thread. So your machine is busy creating and killing new threads instead of doing useful work.

When you have a CPU, you have to give it a small amount of expensive jobs to get a performance boost.

This is, maybe awkward, but sufficient example: we leave sumRangeSquare the same as in sequential variant and split our range into 4 pieces, then run 4 parallel threads with sumRangeSquares, then sum 4 outputs in final result.

module Main where

import Control.Parallel.Strategies
import System.Environment

sumRangeSquares :: (Integer, Integer) -> Integer
sumRangeSquares (start, end) = sum $ map (^2) [start .. end]

main :: IO ()
main = do
    [start, end] <- map (read :: (String -> Integer)) <
gt; getArgs
    let space = [(start+(i-1)*(div (end-start) 4), start+i*(div (end-start) 4)) | i <- [1..3]]
    print $ sum $ parMap rseq sumRangeSquares (space ++ [(snd $ last space, end)])

I used 1 and 30 000 000 as args to get more significant result and I have this for you sequential variant:

time ./app/Main 1 30000000

real    0m1,353s
user    0m1,350s
sys     0m0,004s

This for my parallel, run with one thread:

time ./app/Main 1 30000000 +RTS -N1 -lf

real    0m1,334s
user    0m1,311s
sys     0m0,022s

This for my parallel, run with four threads:

time ./app/Main 1 30000000 +RTS -N4 -lf

real    0m0,416s
user    0m1,386s
sys     0m0,024s
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文