当前位置：文江博客话题详情

CUDA __umul24 函数，有用还是没用？

发布于 2024-10-29 08:26:00 字数 70 浏览 0 评论 0原文

是否值得在 CUDA 内核中用 __umul24 函数替换所有乘法？我读到了不同和相反的观点，但我仍然无法做出一个基准来弄清楚

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

女中豪杰 2024-11-05 08:26:00

只是想提出与 Ashwin/fabrizioM 略有不同的观点...

如果您只是想自学 CUDA，他们的答案可能或多或少可以接受。但是，如果您实际上尝试将生产级应用程序部署到商业或研究环境中，那么这种态度通常是不可接受的，除非您绝对确定您的最终用户（或者您，如果您是最终用户）用户）是 Fermi 或更高版本。

更有可能的是，许多用户将在旧机器上运行 CUDA，他们将从使用计算级别的适当功能中受益。这并不像 Ashwin/fabrizioM 所说的那么难。

例如，在我正在编写的代码中，我正在使用：

//For prior to Fermi use umul, for Fermi on, use
//native mult.
__device__ inline void MultiplyFermi(unsigned int a, unsigned int b)
{ a*b; }

__device__ inline void MultiplyAddFermi(unsigned int a, unsigned int b,
                                        unsigned int c)
{ a*b+c; }

__device__ inline void MultiplyOld(unsigned int a, unsigned int b)
{ __umul24(a,b); }

__device__ inline void MultiplyAddOld(unsigned int a, unsigned int b,
                                      unsigned int c)
{ __umul24(a,b)+c; }

//Maximum Occupancy =
//16384
void GetComputeCharacteristics(ComputeCapabilityLimits_t MyCapability)
{
    cudaDeviceProp DeviceProperties;
    cudaGetDeviceProperties(&DeviceProperties, 0 );
    MyCapability.ComputeCapability =
    double(DeviceProperties.major)+ double(DeviceProperties.minor)*0.1;
}

现在这里有一个缺点。它是什么？

好吧，任何使用乘法的内核，都必须有两个不同版本的内核。

值得吗？

仔细考虑一下，这是一个微不足道的副本&粘贴工作，你就会提高效率，我认为是的。毕竟，从概念上讲，CUDA 并不是最简单的编程形式（任何并行编程也不是）。如果性能并不重要，请问自己：为什么使用 CUDA？

如果性能至关重要，那么就应该疏忽代码惰性，要么放弃旧设备，要么发布次优执行，除非您绝对有信心可以放弃部署的旧支持（允许最佳执行）。

对于大多数人来说，提供遗留支持是有意义的，因为一旦您意识到如何做到这一点，这并不难。请注意，这意味着您还需要更新代码，以便适应未来架构的变化。

一般来说，您应该记下代码的目标最新版本、编写时间，并且如果用户的计算能力超出了最新实现的优化范围，则可能会向用户打印某种警告。

Just wanted to chime in with a slightly different opinion than Ashwin/fabrizioM...

If you're just trying to teach yourself CUDA, their answer is probably more or less acceptable. But if you're actually trying to deploy a production-grade app to a commercial or research setting, that sort of attitude is generally not acceptable, unless you are absolutely sure that your end users' (or you, if you're the end user) is Fermi or later.

More likely, there's many users who will be running CUDA on legacy machines who would receive benefits from using Compute Level appropriate functionality. And it's not as hard as Ashwin/fabrizioM make it out to be.

e.g. in a code I'm working on, I'm using:

//For prior to Fermi use umul, for Fermi on, use
//native mult.
__device__ inline void MultiplyFermi(unsigned int a, unsigned int b)
{ a*b; }

__device__ inline void MultiplyAddFermi(unsigned int a, unsigned int b,
                                        unsigned int c)
{ a*b+c; }

__device__ inline void MultiplyOld(unsigned int a, unsigned int b)
{ __umul24(a,b); }

__device__ inline void MultiplyAddOld(unsigned int a, unsigned int b,
                                      unsigned int c)
{ __umul24(a,b)+c; }

//Maximum Occupancy =
//16384
void GetComputeCharacteristics(ComputeCapabilityLimits_t MyCapability)
{
    cudaDeviceProp DeviceProperties;
    cudaGetDeviceProperties(&DeviceProperties, 0 );
    MyCapability.ComputeCapability =
    double(DeviceProperties.major)+ double(DeviceProperties.minor)*0.1;
}

Now there IS a downside here. What is it?

Well any kernel you use a multiplication, you must have two different versions of the kernel.

Is it worth it?

Well consider, this is a trivial copy & paste job, and you're gaining efficiency, yes in my opinion. After all, CUDA isn't the easiest form of programming conceptually (nor is any parallel programming). If performance is NOT critical, ask yourself: why are you using CUDA?

If performance is critical, it's negligent to to code lazy and either abandon legacy devices or post less-than-optimal execution, unless you're absolutely confident you can abandon legacy support for your deployment (allowing optimal execution).

For most, it makes sense to provide legacy support, given that it's not that hard once you realize how to do it. Be aware this means that that you will also need to update your code, in order to adjust in to changes in future architectures.

Generally you should note what the latest version the code was targeted at, when it was written and perhaps print some sort of warning to users if they have a compute capability beyond what your latest implementation is optimized for.

回复收藏 0 原文