优化大型模型 - 尝试利用并行性
在过去的一周左右的时间里,我一直在重写一大块代码,以使其尽快运行。
该代码对衍射激光束进行建模,其本质是 640*640 内核在许多 2D 1280*1280 切片上的卷积 - 每个切片都是沿光束轴的新位置。
优化的第一阶段是编译我的函数,第二阶段是了解 Mathematica 喜欢处理大量数据 - 因此一次向其传递多个层的 3D 空间,而不是一个接一个的切片。
然而这吃掉了我的内存!
这是我当前的设置:
Func2[K_ , ZRange_] :=
Module[{layers = Dimensions[ZRange][[1]]},
x = ConstantArray[Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
y = ConstantArray[Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];
UTC = Func3[x, y, z];
Abs[ListConvolve[K, #] & /@ UTC]
]
Func3 = Compile[{{x, _Real}, {y, _Real}, {z, _Real}},
Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2],
RuntimeAttributes -> {Listable},
CompilationTarget -> "C"
];
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 11.},
{10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};
results = Table[Func2[kernel, ZList], {ZList, ZRangeList}];
一些解释:
- 这项工作分为两个函数,因为我希望能够尽可能多地进行编译。
- Z 值被分成一个列表列表,以使函数一次评估多个层。
一些问题:
- 你如何让它更快?
- 按原样运行时,我的两个内核都被使用,但由一个 mathematica 内核使用。如果我使用 ParallelTable 运行它,它会运行多个内核,但会消耗更多 RAM,最终速度会更慢。
- 我希望能够在尽可能多的内核上运行它 - 我正在运行 LightweightGrid - 我该如何做到这一点?
- 为什么我不能传递不同维度的编译函数列表?
I have a chunk of code I've been re-writing over the past week or so to get it running as quickly as possible.
The code is modeling a diffracted laser beam and its essence is a convolution of a 640*640 kernel over many 2D 1280*1280 slices - each slice being a new position along the beam axis.
Stage one of optimizing was Compiling my functions and stage two was learning that Mathematica likes to operate with large lists of data - so passing it a 3D space of many layers at once as opposed to slices one after another.
However this ate my RAM!
Here is my current set up:
Func2[K_ , ZRange_] :=
Module[{layers = Dimensions[ZRange][[1]]},
x = ConstantArray[Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
y = ConstantArray[Table[y, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], {layers}];
z = Table[ConstantArray[z, {1281, 1281}], {z, ZRange}];
UTC = Func3[x, y, z];
Abs[ListConvolve[K, #] & /@ UTC]
]
Func3 = Compile[{{x, _Real}, {y, _Real}, {z, _Real}},
Module[{Sr2R2 = Sqrt[x^2 + y^2 + z^2]},
0.5 (1. + z/Sr2R2) Exp[2 \[Pi] I (Sr2R2 - z)]/Sr2R2],
RuntimeAttributes -> {Listable},
CompilationTarget -> "C"
];
ZRangeList = {{20., 19., 18., 17., 16., 15., 14., 13., 12., 11.},
{10., 9., 8., 7., 6., 5., 4., 3., 2., 1.}};
results = Table[Func2[kernel, ZList], {ZList, ZRangeList}];
Some explanations:
- The work is split into two functions as I want to be able to compile as much as possible.
- The Z values are split into a list of lists to make the functions evaluate several layers at once.
Some Questions:
- How would you make this faster?
- When run as is, both my cores are used but by one mathematica kernel. If i run it in with ParallelTable it runs multiple kernels but eats more RAM and is ultimately slower.
- I would like to be able to run it on as many cores as possible - I have a LightweightGrid running - how can I do this?
- Why can't I pass a Compiled function lists of different dimensions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我立刻想到的是
Abs[ListConvolve[K, #] & /@ 世界标准时间]
可以做成
ParallelMap[Abs@ListConvolve[K, #] & ,UTC]
然而,我真的很惊讶 ParallelTable 比普通表慢,因为只有两种情况才会出现这种情况:并行化比执行任务更昂贵,或者并行化需要子内核之间的太多通信。
并行化后您是否分发了定义?例如,对于上面的情况,在开始之前,您首先要启动 Kernels,然后分发 K 的定义(UTC 不需要分发,因为它实际上并未在子内核中使用,而是其部分在子内核中使用。看看您是否可以还可以使用 Share[] 来减少内存负载。
您是否考虑过使用 CUDA 来执行此操作?这似乎非常适合您在函数内执行的简单数字数学运算。
还要注意,您不断地重新创建此表。 : Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}],为什么不将其设为变量,并创建该变量值的 ConstantArray?
你在每一项上都浪费了大约 0.2 秒。
最后,一个小小的怪癖:当你试图优化时,除法总是一件可怕的事情 - 它很耗时:
可以做得更好(请随意检查我的数学):
Thing that jumps out instantly at me is
Abs[ListConvolve[K, #] & /@ UTC]
could be made into
ParallelMap[Abs@ListConvolve[K, #] & , UTC]
However, I'm really surprised that ParallelTable is slower than plain table since that's only the case in 2 situations: it's more expensive to parallelize than to performa a task, or parallelization requires too much communication between sub-kernels.
Have you distributed your definitions when you've parallelized? E.g. for above, you'd LaunchKernels first, before you even start, and then distribute definitions of K (UTC doesn't nneed to be distributed, since it's not actually used in sub-kernels, rather its parts are. See if you can make use of Share[] as well to reduce memory load.
Have you thought about doing this with CUDA? Seems perfect for the simple numeric maths that you're doing inside the functions.
Also notice, you're constantly re-creating this table: Table[x, {x, -80, 80, 0.125}, {y, -80, 80, 0.125}], why not make it a variable, and create a ConstantArray of the value of that variable?
You're wasting about 0.2 seconds on every one of those.
Lastly, a tiny little quirk: division is always a terrible thing to do when you're trying to optimize - it's time consuming:
can be made a hair better as (feel free to check my math):
既没有并行化,也没有 C 编译
(在Windows 64位上使用equation.com的gcc 4.7并通过VC++Express增强)确实改善了时序。
运行这段代码大约需要 6.5 秒:
将所有内容编译到一个函数中速度较慢(8.1 秒):
通常不太容易弄清楚 ParallelTable 和朋友何时真正提供帮助。
仅取决于问题、大小、Mathematica 版本等。
Neither parallelization nor even the C-compilation
(using gcc 4.7 from equation.com augmented by VC++Express on Windows 64 bit) does improve timings.
Running this code needs about 6.5 seconds:
and compiling everything into one function is slower (8.1 sec):
It is usually not so easy to figure out when ParallelTable and friends really help.
Just depends on the problem, size, Mathematica verison, etc.