CUDA 上的并行归约和查找索引
我有一个 20K 值的数组,我将其减少为 50 个块,每个块有 400 个线程。 num_blocks = 50 和 block_size = 400。
我的代码如下所示:
getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
这里的问题是,在某些时候“nTotalThreads”不完全是 2 的幂,导致索引产生垃圾值。数组 out1 给出了每个块中的最大值,这是正确且经过验证的。但索引的值是错误的。例如:第一个块中的最大值出现在索引 = 40 处,但内核给出的索引值为 15。类似地,第二个块中的最大值出现在 440,但内核给出 416。
有什么建议吗? ?
I have an array of 20K values and I am reducing it over 50 blocks with 400 threads each. num_blocks = 50 and block_size = 400.
My code looks like this:
getmax <<< num_blocks,block_size >>> (d_in, d_out1, d_indices);
__global__ void getmax(float *in1, float *out1, int *index)
{
// Declare arrays to be in shared memory.
__shared__ float max[threads];
int nTotalThreads = blockDim.x; // Total number of active threads
float temp;
float max_val;
int max_index;
int arrayIndex;
// Calculate which element this thread reads from memory
arrayIndex = gridDim.x*blockDim.x*blockIdx.y + blockDim.x*blockIdx.x + threadIdx.x;
max[threadIdx.x] = in1[arrayIndex];
max_val = max[threadIdx.x];
max_index = blockDim.x*blockIdx.x + threadIdx.x;
__syncthreads();
while(nTotalThreads > 1)
{
int halfPoint = (nTotalThreads >> 1);
if (threadIdx.x < halfPoint)
{
temp = max[threadIdx.x + halfPoint];
if (temp > max[threadIdx.x])
{
max[threadIdx.x] = temp;
max_val = max[threadIdx.x];
}
}
__syncthreads();
nTotalThreads = (nTotalThreads >> 1); // divide by two.
}
if (threadIdx.x == 0)
{
out1[num_blocks*blockIdx.y + blockIdx.x] = max[threadIdx.x];
}
if(max[blockIdx.x] == max_val )
{
index[blockIdx.x] = max_index;
}
}
The problem/issue here is that at some point “nTotalThreads” is not exactly a power of 2, resulting in garbage value for the index. The array out1 gives me the maximum value in each block, which is correct and validated. But the value of the index is wrong. For example: the max value in the first block occurs at index=40, but the kernel gives the values of index as 15. Similarly the value of the max in the second block is at 440, but the kernel gives 416.
Any suggestions??
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
确保 nTotalThreads 始终为 2 的幂应该很容易。
将第一次减少作为特殊情况,使 nTotalThreads 为 2 的幂。例如,由于您从一个块中的 400 个线程开始,因此使用 256 个线程进行第一次减少线程。线程 0-199 将从两个值减少,而线程 200-255 则不必在此初始步骤中进行减少。从那时起你就一切都好起来了。
It should be easy to ensure that nTotalThreads is always a power of 2.
Make the first reduction a special case that gets the nTotalThreads to a power of 2. eg, since you start with 400 threads in a block, do the first reduction with 256 threads. Threads 0-199 will reduce from two values, and threads 200-255 just won't have to do a reduction in this initial step. From then on out you'd be fine.
您确定您确实需要“问题”“nTotalThreads”不完全是 2 的幂吗?
它使代码的可读性降低,我认为它也会影响性能。
无论如何,如果你替换
nTotalThreads = (nTotalThreads >> 1);
其中
nTotalThreads = (nTotalThreads +1 ) >>> 1;
它应该解决有关此“问题”的一个错误。
弗朗西斯科
Are you sure you really need the 'issue' “nTotalThreads” is not exactly a power of 2?
It makes the code less readable and I think it can interfere with the performance too.
Anyway if you substitute
nTotalThreads = (nTotalThreads >> 1);
with
nTotalThreads = (nTotalThreads +1 ) >> 1;
it should solve one bug concerning this 'issue'.
Francesco
第二个杰夫的建议。
看一下CUDA Thrust 库的reduce 函数。与大量手工调整的内核相比,效率高达 95% 以上,并且非常灵活且易于使用。
Second Jeff's suggestion.
Take a look at the CUDA Thrust Library's reduce function. This is demonstrated to have 95+% efficiency compared with heavily hand-tuned kernels and is pretty flexible and easy to use.
检查我的内核。您可以将块结果放入数组(可以位于全局内存中)并在全局内存中获取结果
并查看我如何在主机代码中调用它:
check my kernel. You can put your blockresults to array(which can be in global memory) and get the result in global memory
And see how I call it in host code: