优化有关寄存器的 CUDA 内核

发布于 2024-11-08 13:03:46 字数 210 浏览 0 评论 0原文

我正在使用 CUDA 占用计算器来尝试优化我的 CUDA 内核。目前我使用 34 个寄存器和零共享内存...因此,每块 310 个线程的最大占用率为 63%。当我可以以某种方式将寄存器(例如通过共享内存传递内核参数)更改为 20 或更低时,我可以获得 100% 的占用率。这是一个好方法吗?还是您建议我使用另一种优化方法?

此外,我还想知道计算能力 2.1 的占用率计算器是否有更新版本!?

I'm using the CUDA Occupancy calculator to try to optimize my CUDA kernel. Currently I'm using 34 registers and zero shared memory...Thus the maximum occupancy is 63% for 310 Threads per block. When I could somehow change the registers (e.g. by passing kernel parameters via shared memory) to 20 or below I could get an occupancy of 100%. Is this a good way to do it or would you advise me to use another path of optimizing?

Further I'm also wondering if there's a newer version of the occupancy calculator for Compute Capability 2.1!?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

酷到爆炸 2024-11-15 13:03:46

需要考虑的一些要点:

  1. 每块 320 个线程将提供与 310 个线程相同的占用率,因为占用率定义为每个 SM 的活动扭曲/最大扭曲,并且扭曲大小始终为 32 个线程。您不应该永远使用不是 32 的整数倍的块大小。这只会浪费内核和周期。
  2. 内核参数在计算 2.1 设备上的常量内存中传递,它们对占用或寄存器使用没有影响。
  3. GPU设计的流水线延迟约为21个周期。因此,对于 Fermi GPU,您需要大约 43% 的占用率来覆盖所有内部调度延迟。完成此操作后,您可能会发现尝试提高入住率的好处相对较小。
  4. 争取 100% 的入住率通常从来都不是一个建设性的优化目标。如果您还没有这样做,我强烈建议您查看 Vasily Volkov 在 GTC 2010 上的演讲 “在较低的占用率下获得更好的性能”,他展示了各种令人惊讶的结果,例如代码在 8% 的占用率下达到了峰值内存带宽的 85%。
  5. 最新的占用计算器不涵盖​​计算 2.1,但计算 2.0 的有效占用规则也适用于 2.1 设备。计算 2.1 多处理器中的额外核心通过指令级并行性和几乎无序的执行来发挥作用。与计算 2.0 设备相比,这实际上根本不会改变设备的占用特性。

Some points to consider:

  1. 320 threads per block will give the same occupancy as 310, because occupancy is defined as active warps/maximum warps per SM, and the warp size is always 32 threads. You should never use a block size which is not a round multiple of 32. That just wastes cores and cycles.
  2. Kernel parameters are passed in constant memory on your compute 2.1 device, and they have no effect on occupancy or register usage.
  3. The GPU design has a pipeline latency of about 21 cycles. So for a Fermi GPU, you need about 43% occupancy to cover all of the internal scheduling latency. Once that is done, you may find that there is relatively little benefit in trying to achieve higher occupancy.
  4. Striving for 100% occupancy is usually never a constructive optimization goal. If you have not done so, I highly recommend looking over Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy", where he shows all sorts of surprising results, like code hitting 85% of peak memory bandwidth at 8% occupancy.
  5. The newest occupancy calculator doesn't cover compute 2.1, but the effective occupancy rules for compute 2.0 apply to 2.1 devices too. The extra cores in the compute 2.1 multiprocessor come into play via instruction level parallelism and what is almost out of order execution. That really doesn't change the occupancy characteristics of the device at all compared to compute 2.0 devices.
柠檬色的秋千 2024-11-15 13:03:46

talonmies 是正确的,入住率被高估了。

Vasily Volkov 在 GTC2010 上就此主题做了精彩演讲:“以更低的占用率实现更好的性能”。

http://www.cs.berkeley.edu/~volkov/volkov10-GTC .pdf

talonmies is correct, occupancy is overrated.

Vasily Volkov had a great presentation at GTC2010 on this topic: "Better Performance at Lower Occupancy."

http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文