使用cuda的最佳方式
使用cuda的方式有:
- 自动并行工具,如PGI工作站;
- 包装器如 Thrust(STL 风格)
- NVidia GPUSDK(运行时/驱动程序 API)
哪一个在性能或学习曲线或其他因素方面更好? 有什么建议吗?
There are ways of using cuda:
- auto-paralleing tools such as PGI workstation;
- wrapper such as Thrust(in STL style)
- NVidia GPUSDK(runtime/driver API)
Which one is better for performance or learning curve or other factors?
Any suggestion?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
绩效排名可能是 3、2、1。
学习曲线是 (1+2), 3。
如果您成为 CUDA 专家,那么由于控制,使用 GPU SDK 书中的所有技巧几乎不可能击败您手写代码的性能它给你的。
也就是说,像 Thrust 这样的包装器是由 NVIDIA 工程师编写的,并且在几个问题上显示与手动 CUDA 相比具有 90-95+% 的效率。它们所具有的归约、扫描和许多很酷的迭代器对于解决很多类型的问题也很有用。
自动并行化工具往往无法像卡尔菲利普提到的那样在不同的内存类型上表现得那么好。
我首选的工作流程是使用 Thrust 尽可能多地编写,然后使用 GPU SDK 来完成其余的工作。这在很大程度上是为了减少开发时间和提高可维护性而不要牺牲太多性能的因素。
Performance rankings will likely be 3, 2, 1.
Learning curve is (1+2), 3.
If you become a CUDA expert, then it will be next to impossible to beat the performance of your hand-rolled code using all the tricks in the book using the GPU SDK due to the control that it gives you.
That said, a wrapper like Thrust is written by NVIDIA engineers and shown on several problems to have 90-95+% efficiency compared with hand-rolled CUDA. The reductions, scans, and many cool iterators they have are useful for a wide class of problems too.
Auto-parallelizing tools tend to not do quite as good a job with the different memory types as karlphillip mentioned.
My preferred workflow is using Thrust to write as much as I can and then using the GPU SDK for the rest. This is largely a factor of not trading away too much performance to reduce development time and increase maintainability.
使用传统的 CUDA SDK,以获得性能和更小的学习曲线。
CUDA 公开了多种类型的内存(全局、共享、纹理),这些内存对应用程序的性能有巨大的影响,网上有很多关于它的文章。
此页面非常有趣,并提到了 博士。多布的。
Go with the traditional CUDA SDK, for both performance and smaller learning curve.
CUDA exposes several types of memory (global, shared, texture) which have a dramatic impact on the performance of your application, there are great articles about it on the web.
This page is very interesting and mentions the great series of articles about CUDA on Dr. Dobb's.
我相信 NVIDIA GPU SDK 是最好的,但有一些注意事项。例如,尽量避免使用 cutil.h 函数,因为这些函数是专门为与 SDK 一起使用而编写的,我个人以及许多其他人都遇到了其中的一些问题和错误,这些问题和错误很难解决修复(这个“库”也没有文档,而且我听说 NVIDIA 根本不支持它)
相反,正如您提到的,使用提供的两个 API 之一。我特别推荐运行时 API,因为它是更高级别的 API,因此您不必像在设备 API 中那样担心所有低级别实现细节。
这两个 API 均完整记录在《CUDA 编程指南》和《CUDA 参考指南》中,两者均随每个 CUDA 版本更新并提供。
I believe that the NVIDIA GPU SDK is the best, with a few caveats. For example, try to avoid using the cutil.h functions, as these were written solely for use with the SDK, and I've personally, as well as many others, have run into some problems and bugs in them, that are hard to fix (There also is no documentation for this "library" and I've heard that NVIDIA does not support it at all)
Instead, as you mentioned, use the one of the two provided APIs. In particular I recommend the Runtime API, as it is a higher level API, and so you don't have to worry quite as much about all of the low level implementation details as you do in the Device API.
Both APIs are fully documented in the CUDA Programming Guide and CUDA Reference Guide, both of which are updated and provided with each CUDA release.
这取决于您想在 GPU 上做什么。如果您的算法能够从 Thrust 提供的功能中受益匪浅,例如归约、前缀、总和,那么 Thrust 绝对值得一试,我敢打赌您自己无法在纯 CUDA C 中更快地编写代码。
但是,如果您要移植已经从 CPU 到 GPU 的并行算法,用普通的 CUDA C 编写它们可能会更容易。我已经有成功的项目,沿着这条路线有很好的加速,并且执行实际计算的 CPU/GPU 代码几乎相同。
您可以在某种程度上结合这两种范例,但据我所知,如果您想将所有内容都放在一个大的胖内核中(将过于频繁的内核启动排除在外),那么您将为每个推力调用启动新的内核,您必须将纯 CUDA C 与 SDK 一起使用。
我发现纯 CUDA C 实际上更容易学习,因为它可以让您很好地理解 GPU 上发生的事情。 Thrust 在代码行之间添加了很多魔力。
我从未使用过自动并行工具,例如 PGI 工作站,但我不建议在方程式中添加更多“魔法”。
It depends on what you want to do on the GPU. If your algorithm would highly benefit from the things thrust can offer, like reduction, prefix, sum, then thrust is definitely worth a try and I bet you can't write the code faster yourself in pure CUDA C.
However if you're porting already parallel algorithms from the CPU to the GPU, it might be easier to write them in plain CUDA C. I had already successful projects with a good speedup going this route, and the CPU/GPU code that does the actual calculations is almost identical.
You can combine the two paradigms to some extend, but as far as I know you're launching new kernels for each thrust call, if you want to have all in one big fat kernel (taking too frequent kernel starts out of the equation), you have to use plain CUDA C with the SDK.
I find the pure CUDA C actually easier to learn, as it gives you quite a good understanding on what is going on on the GPU. Thrust adds a lot of magic between your lines of code.
I never used auto-paralleing tools such as PGI workstation, but I wouldn't advise to add even more "magic" into the equation.