pyCUDA 与 C 性能差异?
我是 CUDA 编程的新手,我想知道 pyCUDA 的性能与纯 C 实现的程序相比如何。 性能会大致相同吗?有什么我应该注意的瓶颈吗?
编辑: 我显然首先尝试用谷歌搜索这个问题,但很惊讶没有找到任何信息。即我会例外,pyCUDA 人员在他们的常见问题解答中回答了这个问题。
I'm new to CUDA programming and I was wondering how the performance of pyCUDA is compared to programs implemented in plain C.
Will the performance be roughly the same? Are there any bottle necks that I should be aware of?
EDIT:
I obviously tried to google this issue first, and was surprised to not find any information. i.e. I would have excepted that the pyCUDA people have this question answered in their FAQ.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
如果您正在使用 CUDA——无论是直接通过 C 还是使用 pyCUDA——您所做的所有繁重的数值工作都是在 GPU 上执行的内核中完成的,并用 CUDA C 编写(直接由您或间接使用 elementwise内核)。因此,代码的这些部分的性能应该没有真正的差异。
现在,如果您使用 pyCUDA,数组的初始化以及任何后期工作分析都将在 python 中完成(可能使用 numpy),并且通常会比直接在编译语言中完成要慢得多(尽管如果您已经以直接链接到高性能库的方式构建 numpy/scipy,那么这些调用至少会在任何一种语言中执行相同的操作)。但希望您的初始化和终结只是您必须完成的总工作量的一小部分,因此即使存在大量开销,它仍然希望不会对整体运行时产生巨大影响。
事实上,如果事实证明计算的 python 部分确实损害了应用程序的性能,那么开始在 pyCUDA 中进行开发可能仍然是一个很好的入门方式,因为开发要容易得多,而且您始终可以重新开始用直接 C 语言实现那些在 Python 中太慢的代码部分,并从 Python 中调用这些部分,从而获得两全其美的效果。
If you're using CUDA -- whether directly through C or with pyCUDA -- all the heavy numerical work you're doing is done in kernels that execute on the gpu and are written in CUDA C (directly by you, or indirectly with elementwise kernels). So there should be no real difference in performance in those parts of your code.
Now, the initialization of arrays, and any post-work analysis, will be done in python (probably with numpy) if you use pyCUDA, and that generally will be significantly slower than doing it directly in a compiled language (though if you've built your numpy/scipy in such a way that it links directly to high-performance libraries, then those calls at least would perform the same in either language). But hopefully, your initialization and finalization are small fractions of the total amount of work you have to do, so that even if there is significant overhead there, it still hopefully won't have a huge impact on overall runtime.
And in fact if it turns out that the python parts of the computation does hurt your application's performance, starting out doing your development in pyCUDA may still be an excellent way to get started, as the development is significantly easier, and you can always re-implement those parts of the code that are too slow in Python in straight C, and call those from python, gaining some of the best of both worlds.
如果您想知道以不同方式使用 pyCUDA 的性能差异,请参阅 pyCUDA Wiki 示例中包含的 SimpleSpeedTest.py。它对封装在 pyCUDA 中的 CUDA C 内核以及 pyCUDA 设计者创建的几个抽象完成的相同任务进行基准测试。存在性能差异。
If you're wondering about performance differences by using pyCUDA in different ways, see SimpleSpeedTest.py included in the pyCUDA Wiki examples. It benchmarks the same task completed by a CUDA C kernel encapsulated in pyCUDA, and by several abstractions created by pyCUDA's designer. There's a performance difference.
我已经使用 pyCUDA 一段时间了,我喜欢用它进行原型设计,因为它加快了将想法转化为工作代码的过程。
使用 pyCUDA,您将使用 C++ 编写 CUDA 内核,而且它是 CUDA,因此运行该代码的性能不应该有差异。但是,用 Python 编写的用于设置或使用 pyCUDA 内核结果的代码与用 C 编写的代码的性能会有所不同。
I've been using pyCUDA for a little while an I like prototyping with it because it speeds up the process of turning an idea into working code.
With pyCUDA you will be writing the CUDA kernels using C++, and it's CUDA, so there shouldn't be a difference in performance of running that code. But there will be a difference in the performance of the code you write in Python to setup or use the results of the pyCUDA kernel vs the one you write in C.
我在这篇文章中寻找原始问题的答案,我发现问题比我想象的要更深。
根据我的经验,我将用 C 编写的 Cuda 内核和 CUFFT 与用 PyCuda 编写的进行了比较。令人惊讶的是,我发现在我的计算机上,求和、乘法或 FFT 的性能因每次实现而异。例如,对于 2^23 个元素之前的向量大小,我在 cuFFT 中获得了几乎相同的性能。然而,对复向量进行求和和相乘会出现一些问题。对于 N=2^17,C/Cuda 中获得的速度约为 6 倍,而 PyCuda 中仅约为 3 倍。它还取决于执行求和的方式。通过使用 SourceModule 并包装原始 Cuda 代码,我发现了一个问题,即我的内核对于 complex128 向量的 N (<=2^16) 比用于 gpuarray (<=2^24) 的 N 更低。
作为结论,测试和比较问题的两个方面并评估是否方便花时间编写 Cuda 脚本或获得可读性并付出较低性能的代价是一项很好的工作。
I was looking for an answer for the original question in this post and I see the problem Is deeper as I thought.
I my experience, I compared Cuda kernels and CUFFT's written in C with that written in PyCuda. Surprisingly, I found that, on my computer, the performance of suming, multiplying or making FFT's vary from each implentatiom. For example, I got almost the same performance in cuFFT for vector sizes until 2^23 elements. However, suming and multiplying complex vectors show some troubles. The speed up obtained in C/Cuda was ~6X for N=2^17, whilst in PyCuda only ~3X. It also depends on the way that the sumation was performed. By using SourceModule and wrapping the Raw Cuda code, I found the problem that my kernel, for complex128 vectors, was limitated for a lower N (<=2^16) than that used for gpuarray (<=2^24).
As a conclusion, is a good job testing and comparing the two sides of the problem and evaluate if it is convenient spend time in writing a Cuda script or gain readbility and pay the cost of a lower performance.
如果您使用 PyCUDA 并且希望获得高性能,请确保在那里使用 -O3 优化并使用 nvprof/nvvp 来分析您的内核。如果您想从 Python 使用 Cuda,PyCUDA 可能是您的选择。因为否则通过 Python 连接 C++/Cuda 代码简直就是地狱。你必须写很多丑陋的包装。对于 numpy 集成,甚至需要更多的核心总结代码。
Make sure you're using -O3 optimizations there and use nvprof/nvvp to profile your kernels if you're using PyCUDA and you want to get high performance. If you want to use Cuda from Python, PyCUDA is probably THE choice. Because interfacing C++/Cuda code via Python is just hell otherwise. You have to write a hell lot of ugly wrappers. And for numpy integration even more hardcore wrap-up code would be necessary.