CUDA 上的并行归约和查找索引

发布于 2024-09-26 21:18:06 字数 1660 浏览 3 评论 0原文

我有一个 20K 值的数组,我将其减少为 50 个块,每个块有 400 个线程。 num_blocks = 50 和 block_size = 400。

我的代码如下所示:

getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);

__global__ void getmax(float *in1, float *out1, int *index)
{
    // Declare arrays to be in shared memory.
    __shared__ float max[threads];

    int nTotalThreads = blockDim.x;    // Total number of active threads
    float temp;
    float max_val;
    int max_index;
    int arrayIndex;

    // Calculate which element this thread reads from memory
    arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
    max[threadIdx.x] = in1[arrayIndex];
    max_val = max[threadIdx.x];
    max_index = blockDim.x*blockIdx.x + threadIdx.x;
    __syncthreads();

    while(nTotalThreads > 1)
    {
        int halfPoint = (nTotalThreads >> 1);
        if (threadIdx.x < halfPoint) 
        {
            temp = max[threadIdx.x + halfPoint];
            if (temp > max[threadIdx.x]) 
            {
                max[threadIdx.x] = temp;
                max_val = max[threadIdx.x];            
            }
        }
        __syncthreads();

        nTotalThreads = (nTotalThreads >> 1);    // divide by two.
    }

    if (threadIdx.x == 0)
    {
        out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
    }

    if(max[blockIdx.x] == max_val )
    {
        index[blockIdx.x] = max_index;    
    }
}

这里的问题是,在某些时候“nTotalThreads”不完全是 2 的幂,导致索引产生垃圾值。数组 out1 给出了每个块中的最大值,这是正确且经过验证的。但索引的值是错误的。例如:第一个块中的最大值出现在索引 = 40 处,但内核给出的索引值为 15。类似地,第二个块中的最大值出现在 440,但内核给出 416。

有什么建议吗? ?

I have an array of 20K values and I am reducing it over 50 blocks with 400 threads each. num_blocks = 50 and block_size = 400.

My code looks like this:

getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);

__global__ void getmax(float *in1, float *out1, int *index)
{
    // Declare arrays to be in shared memory.
    __shared__ float max[threads];

    int nTotalThreads = blockDim.x;    // Total number of active threads
    float temp;
    float max_val;
    int max_index;
    int arrayIndex;

    // Calculate which element this thread reads from memory
    arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
    max[threadIdx.x] = in1[arrayIndex];
    max_val = max[threadIdx.x];
    max_index = blockDim.x*blockIdx.x + threadIdx.x;
    __syncthreads();

    while(nTotalThreads > 1)
    {
        int halfPoint = (nTotalThreads >> 1);
        if (threadIdx.x < halfPoint) 
        {
            temp = max[threadIdx.x + halfPoint];
            if (temp > max[threadIdx.x]) 
            {
                max[threadIdx.x] = temp;
                max_val = max[threadIdx.x];            
            }
        }
        __syncthreads();

        nTotalThreads = (nTotalThreads >> 1);    // divide by two.
    }

    if (threadIdx.x == 0)
    {
        out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
    }

    if(max[blockIdx.x] == max_val )
    {
        index[blockIdx.x] = max_index;    
    }
}

The problem/issue here is that at some point “nTotalThreads” is not exactly a power of 2, resulting in garbage value for the index. The array out1 gives me the maximum value in each block, which is correct and validated. But the value of the index is wrong. For example: the max value in the first block occurs at index=40, but the kernel gives the values of index as 15. Similarly the value of the max in the second block is at 440, but the kernel gives 416.

Any suggestions??

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

彩虹直至黑白 2024-10-03 21:18:06

确保 nTotalThreads 始终为 2 的幂应该很容易。

将第一次减少作为特殊情况,使 nTotalThreads 为 2 的幂。例如,由于您从一个块中的 400 个线程开始,因此使用 256 个线程进行第一次减少线程。线程 0-199 将从两个值减少,而线程 200-255 则不必在此初始步骤中进行减少。从那时起你就一切都好起来了。

It should be easy to ensure that nTotalThreads is always a power of 2.

Make the first reduction a special case that gets the nTotalThreads to a power of 2. eg, since you start with 400 threads in a block, do the first reduction with 256 threads. Threads 0-199 will reduce from two values, and threads 200-255 just won't have to do a reduction in this initial step. From then on out you'd be fine.

囍笑 2024-10-03 21:18:06

您确定您确实需要“问题”“nTotalThreads”不完全是 2 的幂吗?
它使代码的可读性降低,我认为它也会影响性能。
无论如何,如果你替换

nTotalThreads = (nTotalThreads >> 1);

其中

nTotalThreads = (nTotalThreads +1 ) >>> 1;

它应该解决有关此“问题”的一个错误。

弗朗西斯科

Are you sure you really need the 'issue' “nTotalThreads” is not exactly a power of 2?
It makes the code less readable and I think it can interfere with the performance too.
Anyway if you substitute

nTotalThreads = (nTotalThreads >> 1);

with

nTotalThreads = (nTotalThreads +1 ) >> 1;

it should solve one bug concerning this 'issue'.

Francesco

呆萌少年 2024-10-03 21:18:06

第二个杰夫的建议。

看一下CUDA Thrust 库的reduce 函数。与大量手工调整的内核相比,效率高达 95% 以上,并且非常灵活且易于使用。

Second Jeff's suggestion.

Take a look at the CUDA Thrust Library's reduce function. This is demonstrated to have 95+% efficiency compared with heavily hand-tuned kernels and is pretty flexible and easy to use.

救赎№ 2024-10-03 21:18:06

检查我的内核。您可以将块结果放入数组(可以位于全局内存中)并在全局内存中获取结果

并查看我如何在主机代码中调用它:

sumSeries<<<dim3(blockCount),dim3(threadsPerBlock)>>>(deviceSum,threadsPerBlock*blockCount);

check my kernel. You can put your blockresults to array(which can be in global memory) and get the result in global memory

And see how I call it in host code:

sumSeries<<<dim3(blockCount),dim3(threadsPerBlock)>>>(deviceSum,threadsPerBlock*blockCount);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文