CUDA __umul24 函数,有用还是没用?
是否值得在 CUDA 内核中用 __umul24 函数替换所有乘法?我读到了不同和相反的观点,但我仍然无法做出一个基准来弄清楚
Is worth replacing all multiplications with the __umul24 function in a CUDA kernel? I read different and opposite opinions and I can't still make a bechmark to figure it out
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
只是想提出与 Ashwin/fabrizioM 略有不同的观点...
如果您只是想自学 CUDA,他们的答案可能或多或少可以接受。但是,如果您实际上尝试将生产级应用程序部署到商业或研究环境中,那么这种态度通常是不可接受的,除非您绝对确定您的最终用户(或者您,如果您是最终用户)用户)是 Fermi 或更高版本。
更有可能的是,许多用户将在旧机器上运行 CUDA,他们将从使用计算级别的适当功能中受益。这并不像 Ashwin/fabrizioM 所说的那么难。
例如,在我正在编写的代码中,我正在使用:
现在这里有一个缺点。它是什么?
好吧,任何使用乘法的内核,都必须有两个不同版本的内核。
值得吗?
仔细考虑一下,这是一个微不足道的副本&粘贴工作,你就会提高效率,我认为是的。毕竟,从概念上讲,CUDA 并不是最简单的编程形式(任何并行编程也不是)。如果性能并不重要,请问自己:为什么使用 CUDA?
如果性能至关重要,那么就应该疏忽代码惰性,要么放弃旧设备,要么发布次优执行,除非您绝对有信心可以放弃部署的旧支持(允许最佳执行)。
对于大多数人来说,提供遗留支持是有意义的,因为一旦您意识到如何做到这一点,这并不难。请注意,这意味着您还需要更新代码,以便适应未来架构的变化。
一般来说,您应该记下代码的目标最新版本、编写时间,并且如果用户的计算能力超出了最新实现的优化范围,则可能会向用户打印某种警告。
Just wanted to chime in with a slightly different opinion than Ashwin/fabrizioM...
If you're just trying to teach yourself CUDA, their answer is probably more or less acceptable. But if you're actually trying to deploy a production-grade app to a commercial or research setting, that sort of attitude is generally not acceptable, unless you are absolutely sure that your end users' (or you, if you're the end user) is Fermi or later.
More likely, there's many users who will be running CUDA on legacy machines who would receive benefits from using Compute Level appropriate functionality. And it's not as hard as Ashwin/fabrizioM make it out to be.
e.g. in a code I'm working on, I'm using:
Now there IS a downside here. What is it?
Well any kernel you use a multiplication, you must have two different versions of the kernel.
Is it worth it?
Well consider, this is a trivial copy & paste job, and you're gaining efficiency, yes in my opinion. After all, CUDA isn't the easiest form of programming conceptually (nor is any parallel programming). If performance is NOT critical, ask yourself: why are you using CUDA?
If performance is critical, it's negligent to to code lazy and either abandon legacy devices or post less-than-optimal execution, unless you're absolutely confident you can abandon legacy support for your deployment (allowing optimal execution).
For most, it makes sense to provide legacy support, given that it's not that hard once you realize how to do it. Be aware this means that that you will also need to update your code, in order to adjust in to changes in future architectures.
Generally you should note what the latest version the code was targeted at, when it was written and perhaps print some sort of warning to users if they have a compute capability beyond what your latest implementation is optimized for.
仅在具有 fermi 之前架构的设备中,即具有 2.0 之前的 cuda 功能,其中整数运算单元为 24 位。
在功能 >= 2.0 的 Cuda 设备上,架构是 32 位,_umul24 会更慢而不是更快。原因是它必须用 32 位架构来模拟 24 位操作。
现在的问题是:为了速度增益而付出的努力值得吗?可能不会。
Only in devices with architecture prior to fermi, that is with cuda capabilities prior to 2.0 where the integer arithmetic unit is 24 bit.
On Cuda Device with capabilities >= 2.0 the architecture is 32 bit the _umul24 will be slower instead of faster. The reason is because it has to emulate the 24 bit operation with 32 bit architecture.
The question is now: Is it worth the effort for the speed gain ? Probably not.