MATLAB 并行计算工具箱 - 并行化与 GPU?
我正在与某人合作,他们拥有一些想要加速的 MATLAB 代码。他们目前正在尝试将所有这些代码转换为 CUDA,使其能够在 CPU 上运行。我认为使用 MATLAB 的并行计算工具箱来加速这一过程会更快,并在具有 MATLAB 分布式计算工具箱的集群上运行它,从而允许我在多个不同的工作节点上运行它。现在,作为并行计算工具箱的一部分,您可以使用诸如 GPUArray 之类的东西。但是,我对这将如何运作感到困惑。使用诸如 parfor(并行化)和 gpuarray(GPU 编程)之类的东西是否彼此兼容?我可以同时使用两者吗?是否可以将某些内容拆分到不同的工作节点(并行化),同时还利用每个工作节点上可用的任何 GPU?
他们认为仍然值得探索将所有 matlab 代码转换为 cuda 代码以在具有多个 GPU 的机器上运行所需的时间……但我认为正确的方法是使用 MATLAB 中已内置的功能。
任何帮助、建议、指导将不胜感激!
谢谢!
I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
当您使用 parfor 时,您可以有效地将 for 循环划分为多个任务,每个循环迭代一个任务,并将这些任务拆分为由多个工作线程并行计算,其中每个工作线程可以被视为一个 MATLAB 会话,无需交互式 GUI 。您将集群配置为在集群的每个节点上运行指定数量的工作线程(通常,您会选择运行等于该节点上可用处理器核心数量的工作线程数量)。
另一方面,gpuarray 向 MATLAB 表明您希望使矩阵可供 GPU 处理。在幕后,MATLAB 将数据从主存储器编组到图形板的内部存储器。某些 MATLAB 函数(文档中提供了它们的列表)可以在 gpuarray 上运行,并且计算在 GPU 上进行。
两种技术之间的主要区别在于 parfor 计算发生在直接访问主内存的集群节点的 CPU 上。 CPU 内核通常具有较高的时钟速率,但 CPU 集群中的 CPU 内核数量通常少于 GPU 内核。单独来看,GPU 核心比典型的 CPU 核心慢,并且它们的使用需要将数据从主内存传输到视频内存,然后再返回,但集群中有更多的 GPU 核心。据我所知,混合方法应该是可能的,其中您有一组 PC,每台 PC 都有一个或多个 Nvidia Tesla 板,并且您同时使用 parfor 循环和 gpuarray。不过,我还没有机会尝试这个。
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
如果您主要对模拟感兴趣,GPU 处理是完美的选择。但是,如果您想分析(大)数据,请选择并行化。原因是,如果不需要来回复制数据,GPU 处理只会比 CPU 处理更快。如果进行模拟,您可以在 GPU 上生成大部分数据,只需将结果复制回来即可。如果您尝试在 GPU 上处理更大的数据,您经常会遇到内存不足的问题。
如果您的计算机 CPU 具有大数据结构和 2 个以上内核,那么并行化会非常有用。
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
如果您用 CUDA 编写它,则可以保证在芯片级并行运行,而不是使用 MATLAB 对非并行架构的最佳猜测,并尽最大努力使其并行运行。
有点像饮用新鲜的山水径流与购买过滤水。采用最纯粹的解决方案。
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.