在缩减过程中终止不活动的线程
我见过的大多数减少看起来像:
for( i = N; i > 0; i /=2 ) { if( tid < i ) assign-shared; __syncthreads(); } if( tid == 0 ) copy-value-to-global;
我刚刚将其反转为:
for( i = N; i > 0; i /= 2 ) { if( tid >= i ) return; assign-shared; __syncthreads(); } copy-value-to-global;
并注意到显着的性能优势。让不再参与归约的线程提前返回有什么缺点吗?
Most reductions I've ever seen look like:
for( i = N; i > 0; i /=2 ) { if( tid < i ) assign-shared; __syncthreads(); } if( tid == 0 ) copy-value-to-global;
I've just reversed that to:
for( i = N; i > 0; i /= 2 ) { if( tid >= i ) return; assign-shared; __syncthreads(); } copy-value-to-global;
and noticed a substantial performance benefit. Is there any drawback to having the threads that are no longer involved in the reduction return early?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
由于您已经使用原始代码执行
if
语句,因此我没有看到任何缺点。如果
if
语句的结果不具有空间局部性(通常整个块的结果相同),您可能不会看到任何加速。此外,加速可能取决于您的设备的功能:早期的 CUDA 设备可能无法为您提供性能增强。Since you're already performing an
if
statement with your original code, I don't see any drawback.If the results of your
if
statement did not have spatial locality (generally the same result across the block), you may not see any speedup. Also, the speedup may be dependent on the capabilities of your device: earlier CUDA devices may not give you the performance enhancement.第二个代码段提供了更好的性能,因为未使用的扭曲不需要返回并执行分支检查。
理想情况下,在第二种情况下,每次迭代都会退出一个扭曲,从而减少 GPU 上的负载。
The second code segment provides better performance as the unused warps do not need to come back and perform a branching check.
Ideally, in the second case you would be retiring one warp per iteration reducing the load on the GPU.
多兰在上面的评论中提出了威廉·珀塞尔提出的方案将陷入僵局的问题,根据 在删除线程后我可以使用 __syncthreads() 吗?。关于这个问题,我想说的是,根据 条件同步线程 &死锁(或没有),代码在大多数 GPU 上都不会死锁,因为它们支持提前退出,因为在这些 GPU 中,硬件为每个块维护一个活动线程计数:然后将该计数用于屏障同步,而不是块的初始线程计数。
我已经考虑了
reduce4
CUDA SDK示例并根据OP的问题对其进行了修改。也就是说,我正在比较两个__global__
函数:ORIGINAL
MODIFIED
我已检查修改后的代码在 GT210、GT540M 和 Kepler 上不会死锁K20c。然而,在开普勒卡上,修改版本的加速并不那么相关(时间以毫秒为单位):
我没有检查其他架构的计时,但可能有陷入困境的风险对于某些 GPU 来说,死锁不值得达到的加速比(前提是可达到的加速比保持相同的数量级)。
dolan, in his comment above, is raising the issue that the scheme proposed by William Pursell is going to deadlock, according to Can I use __syncthreads() after having dropped threads?. Concerning this issue, I would say that, according to conditional syncthreads & deadlock (or not), the code will not deadlock on most GPUs, since they support early exit because in those GPUs the hardware maintains an active thread count for each block: this count is then used for barrier synchronization rather than the initial thread count for the block.
I have considered the
reduce4
CUDA SDK example and modified it according to the OP's question. Namely, I'm comparing the two__global__
functions:ORIGINAL
MODIFIED
I have checked that the modified code does not deadlock on GT210, GT540M and Kepler K20c. However, on the Kepler card, the speedup of the modified version is not that relevant (times in
ms
):I haven't checked the timings for other architectures, but probably the risk to fall stuck in a deadlock for some GPUs is not worth the reachable speedup (provided that the reachable speedup remains of the same order of magnitude).