Matlab 和 GPU/CUDA 编程
我需要对同一数据集运行多次独立分析。 具体来说,我需要运行大量 100 glm(广义线性模型)分析,并考虑利用我的显卡 (GTX580)。
由于我可以使用 Matlab 和并行计算工具箱(而且我不擅长 C++),所以我决定尝试一下。
我知道单个 GLM 对于并行计算并不理想,但由于我需要并行运行 100-200 个,我认为使用 parfor 可能是一个解决方案。
我的问题是我不清楚我应该遵循哪种方法。我编写了 matlab 函数 glmfit 的 gpuArray 版本,但使用 parfor 与标准“for”循环相比没有任何优势。
这和matlabpool设置有关系吗?我什至不清楚如何设置它来“查看”GPU 卡。如果我没记错的话,默认情况下,它设置为 CPU 中的核心数(在我的例子中为 4)。 我的方法完全错误吗?
任何建议将不胜感激。
编辑
谢谢。我知道 GPUmat 和 Jacket,我可以开始用 C 编写而不需要太多的努力,但我正在为一个每个人都使用 Matlab 或 R 的部门测试 GPU 计算的可能性。最终目标是基于 C2050 的集群和 Matlab 分发服务器(或者至少这是第一个项目)。 阅读 Mathworks 的 AD 后,我的印象是,即使没有 C 技能,并行计算也是可能的。不可能要求我部门的研究人员学习 C,所以我猜测 GPUmat 和 Jacket 是更好的解决方案,即使限制很大并且不支持像 glm 这样的几个常用例程。
它们如何与集群连接?他们是否使用某种工作分配系统?
I need to run several independent analyses on the same data set.
Specifically, I need to run bunches of 100 glm (generalized linear models) analyses and was thinking to take advantage of my video card (GTX580).
As I have access to Matlab and the Parallel Computing Toolbox (and I'm not good with C++), I decided to give it a try.
I understand that a single GLM is not ideal for parallel computing, but as I need to run 100-200 in parallel, I thought that using parfor could be a solution.
My problem is that it is not clear to me which approach I should follow. I wrote a gpuArray version of the matlab function glmfit, but using parfor doesn't have any advantage over a standard "for" loop.
Has this anything to do with the matlabpool setting? It is not even clear to me how to set this to "see" the GPU card. By default, it is set to the number of cores in the CPU (4 in my case), if I'm not wrong.
Am I completely wrong on the approach?
Any suggestion would be highly appreciated.
Edit
Thanks. I'm aware of GPUmat and Jacket, and I could start writing in C without too much effort, but I'm testing the GPU computing possibilities for a department where everybody uses Matlab or R. The final goal would be a cluster based on C2050 and the Matlab Distribution Server (or at least this was the first project).
Reading the ADs from Mathworks I was under the impression that parallel computing was possible even without C skills. It is impossible to ask the researchers in my department to learn C, so I'm guessing that GPUmat and Jacket are the better solutions, even if the limitations are quite big and the support to several commonly used routines like glm is non-existent.
How can they be interfaced with a cluster? Do they work with some job distribution system?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议您尝试 GPUMat(免费)或 AccelerEyes Jacket(购买,但可以免费试用)而不是并行计算工具箱。该工具箱没有那么多功能。
为了获得最佳性能,您可能需要自己学习一些 C(不需要 C++)和原始 CUDA 代码。许多高级工具在管理内存传输方面可能不够智能(您可能会因为在 PCI-E 总线上不必要地重新整理数据而失去所有计算优势)。
I would recommend you try either GPUMat (free) or AccelerEyes Jacket (buy, but has free trial) rather than the Parallel Computing Toolbox. The toolbox doesn't have as much functionality.
To get the most performance, you may want to learn some C (no need for C++) and code in raw CUDA yourself. Many of these high level tools may not be smart enough about how they manage memory transfers (you could lose all your computational benefits from needlessly shuffling data across the PCI-E bus).
Parfor 将帮助您利用多个 GPU,但不能帮助您利用单个 GPU。问题是单个 GPU 一次只能做一件事,因此单个 GPU 上的 parfor 或单个 GPU 上的 for 将达到完全相同的效果(如您所见)。
Jacket往往效率更高,因为它可以组合多个操作并更有效地运行它们并且具有更多功能,但大多数部门已经拥有并行计算工具箱而不是Jacket,因此这可能是一个问题。您可以尝试演示来检查。
没有 gpumat 经验。
并行计算工具箱越来越好,你需要的是一些大型矩阵运算。 GPU 擅长多次执行同一操作,因此您需要以某种方式将代码组合到一个操作中,或者使每个操作足够大。我们谈论的是至少需要大约 10000 个并行的东西,尽管它不是一组 1e4 矩阵,而是一个至少包含 1e4 个元素的大矩阵。
我确实发现,使用并行计算工具箱,您仍然需要相当多的内联 CUDA 代码才能有效(它仍然非常有限)。它确实更好地允许您内联内核并将 matlab 代码转换为内核,但
Parfor will help you for utilizing multiple GPUs, but not a single GPU. The thing is that a single GPU can do only one thing at a time, so parfor on a single GPU or for on a single GPU will achieve the exact same effect (as you are seeing).
Jacket tends to be more efficient as it can combine multiple operations and run them more efficiently and has more features, but most departments already have parallel computing toolbox and not jacket so that can be an issue. You can try the demo to check.
No experience with gpumat.
The parallel computing toolbox is getting better, what you need is some large matrix operations. GPUs are good at doing the same thing multiple times, so you need to either combine your code somehow into one operation or make each operation big enough. We are talking a need for ~10000 things in parallel at least, although it's not a set of 1e4 matrices but rather a large matrix with at least 1e4 elements.
I do find that with the parallel computing toolbox you still need quite a bit of inline CUDA code to be effective (it's still pretty limited). It does better allow you to inline kernels and transform matlab code into kernels though, something that