使用 Repa “循环二维数组”的最佳方法

发布于 2024-11-05 15:45:06 字数 9616 浏览 0 评论 0原文

我发现 Haskell 的数组库 Repa 非常有趣，并且想制作一个简单的程序，尝试了解如何使用它。我还使用列表进行了一个简单的实现，事实证明这要快得多。我的主要问题是如何改进下面的 Repa 代码以使其最有效（并且希望也非常可读）。我是 Haskell 的新手，我在 Repa 上找不到任何易于理解的教程 [编辑，Haskell Wiki，当我写这篇文章时我不知何故忘记了]，所以不要假设我知道任何事情。 :) 例如，我不确定何时使用force或deepSeqArray。

该程序用于通过以下方式近似计算球体的体积：

指定球体的中心点和半径，以及已知包围球体的长方体内的等距坐标。
该程序获取每个坐标，计算到球体中心的距离，如果它小于球体的半径，则将其用于添加球体的总（近似）体积。

下面显示了两个版本，一种使用列表，一种使用 repa。我知道代码效率低下，尤其是对于这个用例，但我们的想法是稍后使其变得更加复杂。

对于以下值，并使用“ghc -Odph -fllvm -fforce-recomp -rtsopts -threaded”进行编译，列表版本需要 1.4 秒，而 repa 版本使用 +RTS -N1 需要 12 秒，使用 +RTS - 需要 10 秒N2，虽然没有转换火花（我有一台运行 Windows 7 的双核 Intel 机器（Core 2 Duo E7400 @ 2.8 GHz） 64、GHC 7.0.2 和 llvm 2.8）。（注释掉下面 main 中的正确行以仅运行其中一个版本。）

感谢您的帮助！

import Data.Array.Repa as R
import qualified Data.Vector.Unboxed as V
import Prelude as P

-- Calculate the volume of a sphere by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.


particles = [(0,0,0)] -- used for the list alternative --[(0,0,0),(0,2,0)]
particles_repa = [0,0,0::Double] -- used for the repa alternative, can currently just be one coordinate

-- Radius of the particle
a = 4

-- Generate the coordinates. Could this be done more efficiently, and at the same time simple? In Matlab I would use ndgrid.
step = 0.1 --0.05
xrange = [-10,-10+step..10] :: [Double]
yrange = [-10,-10+step..10]
zrange = [-10,-10+step..10]

-- All coordinates as triples. These are used directly in the list version below.
coords = [(x,y,z)  | x <- xrange, y <- yrange, z <- zrange]

---- List code ----

volumeIndividuals = fromIntegral (length particles) * 4*pi*a**3/3

volumeInside = step**3 * fromIntegral (numberInsideParticles particles coords)

numberInsideParticles particles coords = length $ filter (==True) $ P.map (insideParticles particles) coords

insideParticles particles coord =  any (==True) $ P.map (insideParticle coord) particles

insideParticle (xc,yc,zc) (xp,yp,zp) = ((xc-xp)^2+(yc-yp)^2+(zc-zp)^2) < a**2
---- End list code ----

---- Repa code ----

-- Put the coordinates in a Nx3 array.
xcoords = P.map (\(x,_,_) -> x) coords
ycoords = P.map (\(_,y,_) -> y) coords
zcoords = P.map (\(_,_,z) -> z) coords

-- Total number of coordinates
num_coords = (length xcoords) ::Int

xcoords_r = fromList (Z :. num_coords :. (1::Int)) xcoords
ycoords_r = fromList (Z :. num_coords :. (1::Int)) ycoords
zcoords_r = fromList (Z :. num_coords :. (1::Int)) zcoords

rcoords = xcoords_r R.++ ycoords_r R.++ zcoords_r

-- Put the particle coordinates in an array, then extend (replicate) this array so that its size becomes the same as that of rcoords
particle = fromList (Z :. (1::Int) :. (3::Int)) particles_repa
particle_slice = slice particle (Z :. (0::Int) :. All)
particle_extended = extend (Z :. num_coords :. All) particle_slice

-- Calculate the squared difference between the (x,y,z) coordinates of the particle and the coordinates of the cuboid.
squared_diff = deepSeqArrays [rcoords,particle_extended] ((force2 rcoords) -^ (force2 particle_extended)) **^ 2
(**^) arr pow = R.map (**pow) arr

xslice = slice squared_diff (Z :. All :. (0::Int))
yslice = slice squared_diff (Z :. All :. (1::Int))
zslice = slice squared_diff (Z :. All :. (2::Int))

-- Calculate the distance between each coordinate and the particle center
sum_squared_diff = [xslice,yslice,zslice] `deepSeqArrays` xslice +^ yslice +^ zslice

-- Do the rest using vector, since I didn't get the repa variant working.
ssd_vec = toVector sum_squared_diff

-- Determine the number of the coordinates that are within the particle (instead of taking the square root to get the distances above, I compare to the square of the radius here, to improve performance)
total_within = fromIntegral (V.length $ V.filter (<a**2) ssd_vec)
--total_within = foldAll (\x acc -> if x < a**2 then acc+1 else acc) 0 sum_squared_diff

-- Finally, calculate an approximation of the volume of the sphere by taking the volume of the cubes with side step, multiplied with the number of coordinates within the sphere.
volumeInside_repa = step**3 * total_within 

-- Helper function that shows the size of a 2-D array.
rsize = reverse . listOfShape . (extent :: Array DIM2 Double -> DIM2)

---- End repa code ----

-- Comment out the list or the repa version if you want to time the calculations separately.
main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volumeIndividuals
    putStrLn $ "Volume of cubes inside particles (list) = " P.++ show volumeInside
    putStrLn $ "Volume of cubes inside particles (repa) = " P.++ show volumeInside_repa

编辑：一些背景解释了我为什么编写上面的代码：

我主要在Matlab中编写代码，我的性能改进经验主要来自该领域。在 Matlab 中，您通常希望使用直接作用于矩阵的函数进行计算，以提高性能。我在 Matlab R2010b 中使用下面所示的矩阵版本实现上述问题需要 0.9 秒，使用嵌套循环需要 15 秒。尽管我知道 Haskell 与 Matlab 有很大不同，但我希望在 Haskell 中从使用列表改为使用 Repa 数组能够提高代码的性能。列表 -> Repa 数组 -> 向量的转换之所以存在，是因为我不够熟练，无法用更好的东西替换它们。这就是我寻求意见的原因。 :) 因此，上面的时间数字是主观的，因为它可能比语言能力更能衡量我的表现，但它现在对我来说是一个有效的指标，因为什么决定了我会做什么使用取决于我是否可以使其工作。

tl;dr：我知道我上面的 Repa 代码可能是愚蠢的或病态的，但这是我现在能做的最好的事情。我希望能够编写更好的 Haskell 代码，并且希望您能在这个方向上帮助我（老师已经这样做了）。 :)

function archimedes_simple()

particles = [0 0 0]';
a = 4;

step = 0.1;

xrange = [-10:step:10];
yrange = [-10:step:10];
zrange = [-10:step:10];

[X,Y,Z] = ndgrid(xrange,yrange,zrange);
dists2 = bsxfun(@minus,X,particles(1)).^2+ ...
    bsxfun(@minus,Y,particles(2)).^2+ ...
    bsxfun(@minus,Z,particles(3)).^2;
inside = dists2 < a^2;
num_inside = sum(inside(:));

disp('');
disp(['Step = ' num2str(step)]);
disp(['Volume of individual particles = ' num2str(size(particles,2)*4*pi*a^3/3)]);
disp(['Volume of cubes inside particles = ' num2str(step^3*num_inside)]);

end

编辑 2：Repa 代码的新的、更快的、更简单的版本

我现在已经阅读了更多关于 Repa 的内容，并思考了一些。以下是新的 Repa 版本。在本例中，我使用 Repa 扩展函数从值列表中创建 x、y 和 z 坐标作为 3-D 数组（类似于 ndgrid 在 Matlab 中的工作方式）。然后，我映射这些数组以计算到球形粒子的距离。最后，我折叠生成的 3-D 距离数组，计算球体内有多少个坐标，然后将其乘以常数因子以获得近似体积。我的算法实现现在与上面的 Matlab 版本更加相似，并且不再有任何到矢量的转换。

新版本在我的电脑上运行大约需要 5 秒，比上面的有了很大的改进。如果我在编译时使用“线程”，无论是否与“+RTS -N2”组合，时间都是相同的，但线程版本确实最大化了我计算机的两个核心。然而，我确实看到了几滴“-N2”跑到了 3.1 秒，但后来无法重现它们。也许它对同时运行的其他进程非常敏感？我在进行基准测试时关闭了计算机上的大部分程序，但仍有一些程序在运行，例如后台进程。

如果我们使用“-N2”并添加运行时开关来关闭并行 GC (-qg)，则时间始终会下降到约 4.1 秒，并使用 -qa 来“使用操作系统设置线程关联性（实验性）”，时间缩短至约 3.5 秒。查看使用“+RTS -s”运行程序的输出，使用 -qg 执行的 GC 更少。

今天下午我会看看我是否可以在8核计算机上运行代码，只是为了好玩。 :)

import Data.Array.Repa as R
import Prelude as P
import qualified Data.List as L

-- Calculate the volume of a spherical particle by putting it in a bath of coordinates.     Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is     inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to     find an approximate volume.

particles :: [(Double,Double,Double)]
particles = [(0,0,0)]

-- Radius of the spherical particle
a = 4

volume_individuals = fromIntegral (length particles) * 4*pi*a^3/3

-- Generate the coordinates. 
step = 0.1
coords_list = [-10,-10+step..10] :: [Double]
num_coords = (length coords_list) :: Int

coords :: Array DIM1 Double
coords = fromList (Z :. (num_coords ::Int)) coords_list

coords_slice :: Array DIM1 Double
coords_slice = slice coords (Z :. All)

-- x, y and z are 3-D arrays, where the same index into each array can be used to find a     single coordinate, e.g. (x(i,j,k),y(i,j,k),z(i,j,k)).
x,y,z :: Array DIM3 Double
x = extend (Z :. All :. num_coords :. num_coords) coords_slice
y = extend (Z :. num_coords :. All :. num_coords) coords_slice
z = extend (Z :. num_coords :. num_coords :. All) coords_slice

-- Calculate the squared distance from each coordinate to the center of the spherical     particle.
dist2 :: (Double, Double, Double) -> Array DIM3 Double
dist2 particle = ((R.map (squared_diff xp) x) + (R.map (squared_diff yp) y) + (R.map (    squared_diff zp) z)) 
    where
        (xp,yp,zp) = particle
        squared_diff xi xa = (xa-xi)^2

-- Count how many of the coordinates are within the spherical particle.
num_inside_particle :: (Double,Double,Double) -> Double
num_inside_particle particle = foldAll (\acc x -> if x<a^2 then acc+1 else acc) 0 (force     $ dist2 particle)

-- Calculate the approximate volume covered by the spherical particle.
volume_inside :: [Double]
volume_inside = P.map ((*step^3) . num_inside_particle) particles

main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volume_individuals
    putStrLn $ "Volume of cubes inside each particle (repa) = " P.++ (P.concat . (    L.intersperse ", ") . P.map show) volume_inside

-- As an alternative, y and z could be generated from x, but this was slightly slower in     my tests (~0.4 s).
--y = permute_dims_3D x
--z = permute_dims_3D y

-- Permute the dimensions in a 3-D array, (forward, cyclically)
permute_dims_3D a = backpermute (swap e) swap a
    where
        e = extent a
        swap (Z :. i:. j :. k) = Z :. k :. i :. j

新代码的空间分析

与下面 Don Stewart 所做的分析类型相同，但针对的是新的 Repa 代码。

原文

I find the array library Repa for Haskell very interesting, and wanted to make a simple program, to try to understand how to use it. I also made a simple implementation using lists, which proved to be much faster. My main question is how I could improve the Repa code below to make it the most efficient (and hopefully also very readable). I am quite new using Haskell, and I couldn't find any easily understandable tutorial on Repa [edit there is one at the Haskell Wiki, that I somehow forgot when I wrote this], so don't assume I know anything. :) For example, I'm not sure when to use force or deepSeqArray.

The program is used to approximately calculate the volume of a sphere in the following way:

The center point and radius of the sphere is specified, as well as equally spaced coordinates within a cuboid, which are known to encompass the sphere.
The program takes each of the coordinates, calculates the distance to the center of the sphere, and if it is smaller than the radius of the sphere, it is used to add up on the total (approximate) volume of the sphere.

Two versions are shown below, one using lists and one using repa. I know the code is inefficient, especially for this use case, but the idea is to make it more complicated later on.

For the values below, and compiling with "ghc -Odph -fllvm -fforce-recomp -rtsopts -threaded", the list version takes 1.4 s, while the repa version takes 12 s with +RTS -N1 and 10 s with +RTS -N2, though no sparks are converted (I have a dual-core Intel machine (Core 2 Duo E7400 @ 2.8 GHz) running Windows 7 64, GHC 7.0.2 and llvm 2.8). (Comment out the correct line in main below to just run one of the versions.)

Thank you for any help!

import Data.Array.Repa as R
import qualified Data.Vector.Unboxed as V
import Prelude as P

-- Calculate the volume of a sphere by putting it in a bath of coordinates. Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to find an approximate volume.


particles = [(0,0,0)] -- used for the list alternative --[(0,0,0),(0,2,0)]
particles_repa = [0,0,0::Double] -- used for the repa alternative, can currently just be one coordinate

-- Radius of the particle
a = 4

-- Generate the coordinates. Could this be done more efficiently, and at the same time simple? In Matlab I would use ndgrid.
step = 0.1 --0.05
xrange = [-10,-10+step..10] :: [Double]
yrange = [-10,-10+step..10]
zrange = [-10,-10+step..10]

-- All coordinates as triples. These are used directly in the list version below.
coords = [(x,y,z)  | x <- xrange, y <- yrange, z <- zrange]

---- List code ----

volumeIndividuals = fromIntegral (length particles) * 4*pi*a**3/3

volumeInside = step**3 * fromIntegral (numberInsideParticles particles coords)

numberInsideParticles particles coords = length $ filter (==True) $ P.map (insideParticles particles) coords

insideParticles particles coord =  any (==True) $ P.map (insideParticle coord) particles

insideParticle (xc,yc,zc) (xp,yp,zp) = ((xc-xp)^2+(yc-yp)^2+(zc-zp)^2) < a**2
---- End list code ----

---- Repa code ----

-- Put the coordinates in a Nx3 array.
xcoords = P.map (\(x,_,_) -> x) coords
ycoords = P.map (\(_,y,_) -> y) coords
zcoords = P.map (\(_,_,z) -> z) coords

-- Total number of coordinates
num_coords = (length xcoords) ::Int

xcoords_r = fromList (Z :. num_coords :. (1::Int)) xcoords
ycoords_r = fromList (Z :. num_coords :. (1::Int)) ycoords
zcoords_r = fromList (Z :. num_coords :. (1::Int)) zcoords

rcoords = xcoords_r R.++ ycoords_r R.++ zcoords_r

-- Put the particle coordinates in an array, then extend (replicate) this array so that its size becomes the same as that of rcoords
particle = fromList (Z :. (1::Int) :. (3::Int)) particles_repa
particle_slice = slice particle (Z :. (0::Int) :. All)
particle_extended = extend (Z :. num_coords :. All) particle_slice

-- Calculate the squared difference between the (x,y,z) coordinates of the particle and the coordinates of the cuboid.
squared_diff = deepSeqArrays [rcoords,particle_extended] ((force2 rcoords) -^ (force2 particle_extended)) **^ 2
(**^) arr pow = R.map (**pow) arr

xslice = slice squared_diff (Z :. All :. (0::Int))
yslice = slice squared_diff (Z :. All :. (1::Int))
zslice = slice squared_diff (Z :. All :. (2::Int))

-- Calculate the distance between each coordinate and the particle center
sum_squared_diff = [xslice,yslice,zslice] `deepSeqArrays` xslice +^ yslice +^ zslice

-- Do the rest using vector, since I didn't get the repa variant working.
ssd_vec = toVector sum_squared_diff

-- Determine the number of the coordinates that are within the particle (instead of taking the square root to get the distances above, I compare to the square of the radius here, to improve performance)
total_within = fromIntegral (V.length $ V.filter (<a**2) ssd_vec)
--total_within = foldAll (\x acc -> if x < a**2 then acc+1 else acc) 0 sum_squared_diff

-- Finally, calculate an approximation of the volume of the sphere by taking the volume of the cubes with side step, multiplied with the number of coordinates within the sphere.
volumeInside_repa = step**3 * total_within 

-- Helper function that shows the size of a 2-D array.
rsize = reverse . listOfShape . (extent :: Array DIM2 Double -> DIM2)

---- End repa code ----

-- Comment out the list or the repa version if you want to time the calculations separately.
main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volumeIndividuals
    putStrLn $ "Volume of cubes inside particles (list) = " P.++ show volumeInside
    putStrLn $ "Volume of cubes inside particles (repa) = " P.++ show volumeInside_repa

Edit: Some background that explains why I have written the code as it is above:

I mostly write code in Matlab, and my experience of performance improvement comes mostly from that area. In Matlab, you usually want to make your calculations using functions operating on matrices directly, to improve performance. My implementation of the problem above, in Matlab R2010b, takes 0.9 seconds using the matrix version shown below, and 15 seconds using nested loops. Although I know Haskell is very different from Matlab, my hope was that going from using lists to using Repa arrays in Haskell would improve the performance of the code. The conversions from lists->Repa arrays->vectors are there because I'm not skilled enough to replace them with something better. This is why I ask for input. :) The timing numbers above is thus subjective, since it may measure my performance more than that of the abilities of the languages, but it is a valid metric for me right now, since what decides what I will use depends on if I can make it work or not.

tl;dr: I understand that my Repa code above may be stupid or pathological, but it's the best I can do right now. I would love to be able to write better Haskell code, and I hope that you can help me in that direction (dons already did). :)

function archimedes_simple()

particles = [0 0 0]';
a = 4;

step = 0.1;

xrange = [-10:step:10];
yrange = [-10:step:10];
zrange = [-10:step:10];

[X,Y,Z] = ndgrid(xrange,yrange,zrange);
dists2 = bsxfun(@minus,X,particles(1)).^2+ ...
    bsxfun(@minus,Y,particles(2)).^2+ ...
    bsxfun(@minus,Z,particles(3)).^2;
inside = dists2 < a^2;
num_inside = sum(inside(:));

disp('');
disp(['Step = ' num2str(step)]);
disp(['Volume of individual particles = ' num2str(size(particles,2)*4*pi*a^3/3)]);
disp(['Volume of cubes inside particles = ' num2str(step^3*num_inside)]);

end

Edit 2: New, faster and simpler version of the Repa code

I have now read up a bit more on Repa, and thought a bit. Below is a new Repa version. In this case, I create the x, y, and z coordinates as 3-D arrays, using the Repa extend function, from a list of values (similar to how ndgrid works in Matlab). I then map over these arrays to calculate the distance to the spherical particle. Finally, I fold over the resulting 3-D distance array, count how many coordinates are within the sphere, and then multiply it by a constant factor to get the approximate volume. My implementation of the algorithm is now much more similar to the Matlab version above, and there are no longer any conversion to vector.

The new version runs in about 5 seconds on my computer, a considerable improvement from above. The timing is the same if I use "threaded" while compiling, combined with "+RTS -N2" or not, but the threaded version does max out both cores of my computer. I did, however, see a few drops of the "-N2" run to 3.1 seconds, but couldn't reproduce them later. Maybe it is very sensitive to other processes running at the same time? I have shut most programs on my computer when benchmarking, but there are still some programs running, such as background processes.

If we use "-N2" and add the runtime switch to turn off parallel GC (-qg), the time consistently goes down to ~4.1 seconds, and using -qa to "use the OS to set thread affinity (experimental)", the time was shaved down to ~3.5 seconds. Looking at the output from running the program with "+RTS -s", much less GC is performed using -qg.

This afternoon I will see if I can run the code on an 8-core computer, just for fun. :)

import Data.Array.Repa as R
import Prelude as P
import qualified Data.List as L

-- Calculate the volume of a spherical particle by putting it in a bath of coordinates.     Generate coordinates (x,y,z) in a cuboid. Then, for each coordinate, check if it is     inside the sphere. Sum those coordinates and multiply by the coordinate grid step size to     find an approximate volume.

particles :: [(Double,Double,Double)]
particles = [(0,0,0)]

-- Radius of the spherical particle
a = 4

volume_individuals = fromIntegral (length particles) * 4*pi*a^3/3

-- Generate the coordinates. 
step = 0.1
coords_list = [-10,-10+step..10] :: [Double]
num_coords = (length coords_list) :: Int

coords :: Array DIM1 Double
coords = fromList (Z :. (num_coords ::Int)) coords_list

coords_slice :: Array DIM1 Double
coords_slice = slice coords (Z :. All)

-- x, y and z are 3-D arrays, where the same index into each array can be used to find a     single coordinate, e.g. (x(i,j,k),y(i,j,k),z(i,j,k)).
x,y,z :: Array DIM3 Double
x = extend (Z :. All :. num_coords :. num_coords) coords_slice
y = extend (Z :. num_coords :. All :. num_coords) coords_slice
z = extend (Z :. num_coords :. num_coords :. All) coords_slice

-- Calculate the squared distance from each coordinate to the center of the spherical     particle.
dist2 :: (Double, Double, Double) -> Array DIM3 Double
dist2 particle = ((R.map (squared_diff xp) x) + (R.map (squared_diff yp) y) + (R.map (    squared_diff zp) z)) 
    where
        (xp,yp,zp) = particle
        squared_diff xi xa = (xa-xi)^2

-- Count how many of the coordinates are within the spherical particle.
num_inside_particle :: (Double,Double,Double) -> Double
num_inside_particle particle = foldAll (\acc x -> if x<a^2 then acc+1 else acc) 0 (force     $ dist2 particle)

-- Calculate the approximate volume covered by the spherical particle.
volume_inside :: [Double]
volume_inside = P.map ((*step^3) . num_inside_particle) particles

main = do
    putStrLn $ "Step = " P.++ show step
    putStrLn $ "Volume of individual particles = " P.++ show volume_individuals
    putStrLn $ "Volume of cubes inside each particle (repa) = " P.++ (P.concat . (    L.intersperse ", ") . P.map show) volume_inside

-- As an alternative, y and z could be generated from x, but this was slightly slower in     my tests (~0.4 s).
--y = permute_dims_3D x
--z = permute_dims_3D y

-- Permute the dimensions in a 3-D array, (forward, cyclically)
permute_dims_3D a = backpermute (swap e) swap a
    where
        e = extent a
        swap (Z :. i:. j :. k) = Z :. k :. i :. j

Space profiling for the new code

The same types of profiles as Don Stewart made below, but for the new Repa code.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

西瑶 2024-11-12 15:45:06

代码审查注释

47.8% 的时间花在 GC 上。
在堆上分配了 1.5G（！）
repa 代码看起来比列表代码复杂很多。
发生大量并行 GC
我可以在 -N4 机器上获得高达 300% 的效率
放入更多类型签名将使分析变得更容易...
未使用 rsize （看起来很昂贵！）
您将 repa 数组转换为向量，为什么？
您对 (**) 的所有使用都可以替换为 Int 上更便宜的 (^)。
存在数量可疑的大型且恒定的列表。这些都必须转换为数组——这看起来很昂贵。
any (==True) 与 or 相同

时间分析

COST CENTRE                    MODULE               %time %alloc

squared_diff                   Main                  25.0   27.3
insideParticle                 Main                  13.8   15.3
sum_squared_diff               Main                   9.8    5.6
rcoords                        Main                   7.4    5.6
particle_extended              Main                   6.8    9.0
particle_slice                 Main                   5.0    7.6
insideParticles                Main                   5.0    4.4
yslice                         Main                   3.6    3.0
xslice                         Main                   3.0    3.0
ssd_vec                        Main                   2.8    2.1
**^                            Main                   2.6    1.4

表明，您的函数 squared_diff 有点可疑：

squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
                    ((force2 rcoords) -^ (force2 particle_extended)) **^ 2

虽然我没有看到任何明显的修复。

空间分析

空间分析中没有什么太令人惊奇的：您可以清楚地看到列表阶段，然后是向量阶段。列表阶段分配了很多，然后被回收。

在此处输入图像描述

按类型分解堆，我们最初看到分配了很多列表和元组（按需），然后分配并保存一大块数组：

在此处输入图像描述

再次，这正是我们期望看到的..数组的东西并没有分配比列表代码更多的东西（事实上，总体来说有点少），但它只是需要更长的时间来运行。

使用保持器分析检查空间泄漏：

在此处输入图像描述

这里有一些有趣的事情，但没什么令人吃惊的。 zcoords 在列表程序执行的长度内保留，然后为 repa 运行分配一些数组（SYSTEM）。

检查核心

因此，此时我首先假设您确实在列表和数组中实现了相同的算法（即在数组情况下没有完成额外的工作），并且没有明显的空间泄露。所以我怀疑是 repa 代码优化不当。让我们看看核心（使用 ghc-core。

基于列表的代码看起来不错。
数组代码看起来很合理（即出现未装箱的原语），但非常复杂，并且我将很多

内

联编译指示添加到所有顶级数组定义中，希望将其删除。一些 CAf，并且让 GHC 优化数组代码有点困难，这确实使 GHC 难以编译模块（在处理它时分配了 4.3G 和 10 分钟），这对我来说是 GHC 的一个线索。之前无法很好地优化该程序，因为当我增加阈值时，它需要执行新的操作。

使用

GC 中花费的时间。
-H 可以减少所有这些 CAF（顶级常量数据结构）都有点奇怪——真正的
程序不会是顶级常量列表——事实上，这个模块是病态的，导致很多值被长期保留，而不是被优化掉。向内浮动局部定义。
向 Repa 的作者 Ben Lippmeier 寻求帮助，特别是因为其中有一些时髦的内容优化的事情正在发生。

Code Review Notes

47.8% of your time is spent in GC.
1.5G is allocated on the heap (!)
The repa code looks a lot more complicated than the list code.
Lots of parallel GC is occuring
I can get up to 300% efficiency on a -N4 machine
Putting in more type signatures will make it easier to analyze...
rsize isn't used (looks expensive!)
You convert repa arrays to vectors, why?
All your uses of (**) could be replaced by the cheaper (^) on Int.
There's a suspicious number of large, constant lists. Those all have to be converted to arrays -- that seems expensive.
any (==True) is the same as or

Time profiling

COST CENTRE                    MODULE               %time %alloc

squared_diff                   Main                  25.0   27.3
insideParticle                 Main                  13.8   15.3
sum_squared_diff               Main                   9.8    5.6
rcoords                        Main                   7.4    5.6
particle_extended              Main                   6.8    9.0
particle_slice                 Main                   5.0    7.6
insideParticles                Main                   5.0    4.4
yslice                         Main                   3.6    3.0
xslice                         Main                   3.0    3.0
ssd_vec                        Main                   2.8    2.1
**^                            Main                   2.6    1.4

shows that, your function squared_diff is a bit suspicious:

squared_diff :: Array DIM2 Double
squared_diff = deepSeqArrays [rcoords,particle_extended]
                    ((force2 rcoords) -^ (force2 particle_extended)) **^ 2

though I don't see any obvious fix.

Space profiling

Nothing too amazing in the space profile: you clearly see the list phase, then the vector phase. The list phase allocates a lot, which gets reclaimed.

enter image description here

Breaking down the heap by type, we see initially a lot of lists and tuples being allocated (on demand), then a big chunk of arrays are allocated and held:

enter image description here

Again, kinda what we expected to see... the array stuff isn't allocating especially more than the list code (in fact, a bit less overall), but it is just taking a lot longer to run.

Checking for space leaks with retainer profiling:

enter image description here

There's a few interesting things there, but nothing startling. zcoords gets retained for the length of the list program execution, then some arrays (SYSTEM) are being allocated for the repa run.

Inspecting the Core

So at this point I'm firstly assuming that you really did implement the same algorithms in lists and arrays (i.e. no extra work is being done in the array case), and there's no obvious space leak. So my suspicion is badly-optimized repa code. Let's look at the core (with ghc-core.

The list-based code looks fine.
The array code looks reasonable (i.e. unboxed primitives appear), but very complex, and a lot of it.

Inlining all the CAFs

I added inline pragmas to all the top level array definitions, in a hope to remove some of the CAfs, and get GHC to optimize the array code a bit harder. This really made GHC struggle to compile the module (allocating up to 4.3G and 10 minutes while working on it). This is a clue to me that GHC wasn't able to optimize this program well before, since there's new stuff for it to do when I increase the thresholds.

Actions

Using -H can decrease the time spent in GC.
Try to eliminate the conversions from lists to repas to vectors.
All those CAFs (top level constant data structures) are kinda weird -- a real program wouldn't be a list of top level constants -- in fact, this module is pathologically so, causing lots of values to be retained over long periods, instead of being optimized away. Float local definitions inwards.
Ask for help from Ben Lippmeier, the author of Repa, particularly since there's some funky optimization stuff happening.

回复收藏 0 原文

迷乱花海 2024-11-12 15:45:06

我更改了代码以强制 rcoords 和 article_extended，并发现我们直接损失了其中大部分时间：

COST CENTRE                    MODULE               %time %alloc

rcoords                        Main                  32.6   34.4
particle_extended              Main                  21.5   27.2
**^                            Main                   9.8   12.7

此代码的最大单一改进显然是生成以更好的方式这两个持续输入。

请注意，这基本上是一种惰性的流式算法，您浪费时间的地方是一次性分配至少两个 24361803 元素数组的沉没成本，然后可能至少再分配一次或两次或放弃共享。我认为，这段代码最好的情况是，通过一个非常好的优化器和无数的重写规则，大致匹配列表版本（也可以很容易地并行化）。

我认为唐斯的说法是正确的公司会对这个基准感兴趣，但我压倒性的怀疑是，这对于严格的数组库来说不是一个好的用例，而且我怀疑 matlab 在其 ngrid 函数背后隐藏了一些巧妙的优化（optimizations ，我会同意，移植到 repa 可能会很有用）。]

编辑：

这是并行化列表代码的一种快速但肮脏的方法。导入 Control.Parallel.Strategies ，然后将 numberInsideParticles 编写为：

numberInsideParticles particles coords = length $ filter id $ 
    withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords

当我们扩展内核时，这显示出良好的加速效果（一个内核为 12 秒，8 个内核为 3.7 秒），但开销较大Spark创建意味着即使是8核我们也只能匹配单核非并行版本。我尝试了一些替代策略并得到了类似的结果。同样，我不确定我们可以比这里的单线程列表版本做得更好多少。由于每个粒子的计算成本非常低，因此我们主要强调分配，而不是计算。我想，像这样的事情的最大胜利将是矢量化计算，而不是其他任何东西，据我所知，这几乎需要手动编码。

另请注意，并行版本将大约 70% 的时间花在 GC 上，而单核版本将 1% 的时间花在 GC 上（即分配在可能的范围内被有效地融合掉）。

I changed the code to force rcoords and particle_extended, and disovered we were losing the lion's share of time within them directly:

COST CENTRE                    MODULE               %time %alloc

rcoords                        Main                  32.6   34.4
particle_extended              Main                  21.5   27.2
**^                            Main                   9.8   12.7

The biggest single improvement to this code would clearly be to generate those two constant inputs in a better fashion.

Note that this is basically a lazy, streaming algorithm, and where you're losing time is the sunk cost of allocating at least two 24361803-element arrays all in one go, and then probably allocating at least once or twice more or giving up sharing. The very best case for this code, I think, with a very good optimizer and a zillion rewrite rules, will be to roughly match the list version (which can also parallelize very easily).

I think dons is right that Ben & co. will be interested in this benchmark, but my overwhelming suspicion is that this is not a good use case for a strict array library, and my suspicion is that matlab is hiding some clever optimizations behind its ngrid function (optimizations, I'll grant, which it might be useful to port to repa).]

Edit:

Here's a quick and dirty way to parallelize the list code. Import Control.Parallel.Strategies and then write numberInsideParticles as:

numberInsideParticles particles coords = length $ filter id $ 
    withStrategy (parListChunk 2000 rseq) $ P.map (insideParticles particles) coords

This shows good speedup as we scale up cores (12s at one core to 3.7s at 8), but the overhead of spark creation means that even a 8 cores we only match the single core non-parallel version. I tried a few alternate strategies and got similar results. Again, I'm not sure how much better we can possibly do than a single-threaded list version here. Since the computations on each individual particle are so cheap, we're mainly stressing allocation, not computation. The big win on something like this I imagine would be vectorized computation more than anything else, and as far as I know that pretty much requires hand-coding.

Also note that the parallel version spends roughly 70% of its time in GC, while the one-core version spend 1% of its time there (i.e. the allocation is, to the extent possible, is effectively fused away.).

回复收藏 0 原文