在缩减过程中终止不活动的线程

发布于 2024-10-15 05:21:51 字数 382 浏览 2 评论 0原文

我见过的大多数减少看起来像：

for( i = N; i > 0; i /=2 ) {
    if( tid < i )
        assign-shared;
    __syncthreads();
}
if( tid == 0 )
    copy-value-to-global;

我刚刚将其反转为：

for( i = N; i > 0; i /= 2 ) {
    if( tid >= i )
        return;
    assign-shared;
    __syncthreads();
}
copy-value-to-global;

并注意到显着的性能优势。让不再参与归约的线程提前返回有什么缺点吗？

原文

Most reductions I've ever seen look like:

for( i = N; i > 0; i /=2 ) {
    if( tid < i )
        assign-shared;
    __syncthreads();
}
if( tid == 0 )
    copy-value-to-global;

I've just reversed that to:

for( i = N; i > 0; i /= 2 ) {
    if( tid >= i )
        return;
    assign-shared;
    __syncthreads();
}
copy-value-to-global;

and noticed a substantial performance benefit. Is there any drawback to having the threads that are no longer involved in the reduction return early?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

回忆凄美了谁 2024-10-22 05:21:51

由于您已经使用原始代码执行 if 语句，因此我没有看到任何缺点。

如果 if 语句的结果不具有空间局部性（通常整个块的结果相同），您可能不会看到任何加速。此外，加速可能取决于您的设备的功能：早期的 CUDA 设备可能无法为您提供性能增强。

回复收藏 0 原文

掩饰不了的爱 2024-10-22 05:21:51

第二个代码段提供了更好的性能，因为未使用的扭曲不需要返回并执行分支检查。

理想情况下，在第二种情况下，每次迭代都会退出一个扭曲，从而减少 GPU 上的负载。

回复收藏 0 原文

孤檠 2024-10-22 05:21:51

多兰在上面的评论中提出了威廉·珀塞尔提出的方案将陷入僵局的问题，根据在删除线程后我可以使用 __syncthreads() 吗？。关于这个问题，我想说的是，根据条件同步线程 &死锁（或没有），代码在大多数 GPU 上都不会死锁，因为它们支持提前退出，因为在这些 GPU 中，硬件为每个块维护一个活动线程计数：然后将该计数用于屏障同步，而不是块的初始线程计数。

我已经考虑了reduce4 CUDA SDK示例并根据OP的问题对其进行了修改。也就是说，我正在比较两个 __global__ 函数：

ORIGINAL

template <class T>
__global__ void reduce4(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) {
        sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  8]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  4]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  2]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  1]; __syncthreads();
    }

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

MODIFIED

template <class T>
__global__ void reduce4_deadlock_test(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid >= s) return;
        sdata[tid] = mySum = mySum + sdata[tid + s];
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) {
        sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  8]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  4]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  2]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  1]; __syncthreads();
    }

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
    }

我已检查修改后的代码在 GT210、GT540M 和 Kepler 上不会死锁K20c。然而，在开普勒卡上，修改版本的加速并不那么相关（时间以毫秒为单位）：

N          Original          Modified
131072     0.021             0.019
262144     0.030             0.032
524288     0.052             0.052
1048576    0.091             0.080
2097152    0.165             0.146
4194304    0.323             0.286
8388608    0.637             0.555
16777216   1.264             1.122
33554432   2.514             2.189

我没有检查其他架构的计时，但可能有陷入困境的风险对于某些 GPU 来说，死锁不值得达到的加速比（前提是可达到的加速比保持相同的数量级）。

dolan, in his comment above, is raising the issue that the scheme proposed by William Pursell is going to deadlock, according to Can I use __syncthreads() after having dropped threads?. Concerning this issue, I would say that, according to conditional syncthreads & deadlock (or not), the code will not deadlock on most GPUs, since they support early exit because in those GPUs the hardware maintains an active thread count for each block: this count is then used for barrier synchronization rather than the initial thread count for the block.

I have considered the reduce4 CUDA SDK example and modified it according to the OP's question. Namely, I'm comparing the two __global__ functions:

ORIGINAL

template <class T>
__global__ void reduce4(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) {
        sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  8]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  4]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  2]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  1]; __syncthreads();
    }

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

MODIFIED

template <class T>
__global__ void reduce4_deadlock_test(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid >= s) return;
        sdata[tid] = mySum = mySum + sdata[tid + s];
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) {
        sdata[tid] = mySum = mySum + sdata[tid + 32]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid + 16]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  8]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  4]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  2]; __syncthreads();
        sdata[tid] = mySum = mySum + sdata[tid +  1]; __syncthreads();
    }

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
    }

I have checked that the modified code does not deadlock on GT210, GT540M and Kepler K20c. However, on the Kepler card, the speedup of the modified version is not that relevant (times in ms):

N          Original          Modified
131072     0.021             0.019
262144     0.030             0.032
524288     0.052             0.052
1048576    0.091             0.080
2097152    0.165             0.146
4194304    0.323             0.286
8388608    0.637             0.555
16777216   1.264             1.122
33554432   2.514             2.189

I haven't checked the timings for other architectures, but probably the risk to fall stuck in a deadlock for some GPUs is not worth the reachable speedup (provided that the reachable speedup remains of the same order of magnitude).

回复收藏 0 原文

~没有更多了~