通过增加占用率来提高内核性能?
以下是 GT 440 上我的内核的计算视觉分析器的输出:
- 内核详细信息:网格大小:[100 1 1],块大小:[256 1 1]
- 寄存器比率:0.84375 ( 27648 / 32768 ) [35 个寄存器每个线程]
- 共享内存比率:0.336914(16560/ 49152 ) [每 5520 字节 块]
- 每个 SM 的活动块:3(每个 SM 的最大活动块:8)
- 每个 SM 的活动线程:768(每个 SM 的最大活动线程:1536)
- 潜在占用:0.5(24 / 48)
- 占用限制因素:寄存器
请注意粗体标记的项目符号。内核执行时间为121195 us
。
我通过将一些局部变量移动到共享内存来减少每个线程的寄存器数量。计算视觉分析器输出变为:
- 内核详细信息:网格大小:[100 1 1],块大小:[256 1 1]
- 寄存器比率:1 ( 32768 / 32768 ) [每个线程 30 个寄存器]
- 共享内存比率:0.451823 ( 22208 / 49152 ) [每块 5552 字节]
- 活动每个 SM 的块:4(每个 SM 的最大活动块:8)
- 每个 SM 的活动线程:1024(每个 SM 的最大活动线程:1536)
- 潜在占用率:0.666667 ( 32 / 48 )
- 占用限制因素:寄存器
因此,现在与之前版本中的 3
块相比,4
块在单个 SM 上同时执行。然而执行时间是115756 us
,几乎是一样的!为什么?这些块不是完全独立地在不同的 CUDA 核心上执行吗?
Here is an output of Compute Visual Profiler for my kernel on GT 440:
- Kernel details: Grid size: [100 1 1], Block size: [256 1 1]
- Register Ratio: 0.84375 ( 27648 / 32768 ) [35 registers per thread]
- Shared Memory Ratio: 0.336914 ( 16560 / 49152 ) [5520 bytes per
Block] - Active Blocks per SM: 3 (Maximum Active Blocks per SM: 8)
- Active threads per SM: 768 (Maximum Active threads per SM: 1536)
- Potential Occupancy: 0.5 ( 24 / 48 )
- Occupancy limiting factor: Registers
Please, pay your attention to the bullets marked bold. Kernel execution time is 121195 us
.
I reduced a number of registers per thread by moving some local variables to the shared memory. The Compute Visual Profiler output became:
- Kernel details: Grid size: [100 1 1], Block size: [256 1 1]
- Register Ratio: 1 ( 32768 / 32768 ) [30 registers per thread]
- Shared Memory Ratio: 0.451823 ( 22208 / 49152 ) [5552 bytes per Block]
- Active Blocks per SM: 4 (Maximum Active Blocks per SM: 8)
- Active threads per SM: 1024 (Maximum Active threads per SM: 1536)
- Potential Occupancy: 0.666667 ( 32 / 48 )
- Occupancy limiting factor: Registers
Hence, now 4
blocks are simultaneously executed on a single SM versus 3
blocks in the previous version. However, the execution time is 115756 us
, which is almost the same! Why? Aren't the blocks totally independent being executed on different CUDA cores?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您隐含地假设更高的占用率会自动转化为更高的性能。但事实往往并非如此。
NVIDIA 架构需要每个 MP 一定数量的活动 warp,以隐藏 GPU 的指令管道延迟。在基于 Fermi 的卡上,该要求意味着最低占用率约为 30%。目标是高于该最小值的占用率不一定会带来更高的吞吐量,因为延迟瓶颈可能已转移到 GPU 的另一部分。您的入门级 GPU 没有太多内存带宽,并且每个 MP 3 个块很可能足以使您的代码内存带宽受到限制,在这种情况下,增加块数量不会对性能产生任何影响(由于内存控制器争用和缓存未命中的增加,它甚至可能会下降)。此外,您说您将变量溢出到共享内存以减少内核的寄存器占用空间。在 Fermi 上,共享内存仅具有约 1000 Gb/s 的带宽,而寄存器的带宽约为 8000 Gb/s(请参阅下面的链接,了解证明这一点的微基准测试结果)。因此,您已将变量移至较慢的内存,这也可能对性能产生负面影响,抵消高占用率带来的任何好处。
如果您还没有看过,我强烈推荐 Vasily Volkov 在 GTC 2010 上的演讲“较低占用率下的更好性能”
You are implicitly assuming that higher occupancy automatically translates into higher performance. That is most often not the case.
The NVIDIA architecture needs a certain number of active warps per MP in order to hide the instruction pipeline latency of the GPU. On your Fermi based card, that requirement translates to a minimum occupancy of about 30%. Aiming for higher occupancies than that minimum will not necessarily result in higher throughput, as the latency bottleneck can have moved to another part of the GPU. Your entry level GPU doesn't have a lot of memory bandwidth, and it is quite possible that 3 blocks per MP is sufficient to make you code memory bandwidth limited, in which case increasing the number of blocks won't have any effect on performance (it might even go down because of increased memory controller contention and cache misses). Further, you said you spilled variables to shared memory in order to reduce the register foot print of the kernel. On Fermi, shared memory only has about 1000 Gb/s of bandwidth, compared to about 8000 Gb/s for registers (see the link below for the microbenchmarking results which demonstrate this). So you have moved variables to slower memory, which may also have a negative effect on performance, offsetting any benefit which high occupancy affords.
If you have not already seen it, I highly recommend Vasily Volkov's presentation from GTC 2010 "Better performance at lower occupancy" (pdf). Here is it shown how exploiting instruction level parallelism can increase GPU throughput to very high levels at very, very low levels of occupancy.
talonmies 已经回答了你的问题,所以我只想分享一个受上述答案中提到的 V. Volkov 演示文稿第一部分启发的代码。
代码如下:
在我的 GeForce GT540M 上,结果意味着
,如果利用指令级并行性 (ILP),则占用率较低的内核仍然可以表现出高性能。
talonmies has already answered your question, so I just want to share a code inspired by the first part of V. Volkov's presentation mentioned in the answer above.
This is the code:
On my GeForce GT540M, the result is
which means that kernels with lower occupancy can still exhibit high performance, if Instruction Level Parallelism (ILP) is exploited.