CUDA 查找给定数组中的最大值
我尝试开发一个小型 CUDA 程序来查找给定数组中的最大值,
int input_data[0...50] = 1,2,3,4,5....,50
max_value
由 input_data[0]
的第一个值初始化, 最终答案存储在result[0]
中。 内核给出 0 作为最大值。我不知道问题是什么。 我用1块50个线程执行。
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
虽然有内置的功能,但只是我练习一些小问题。
I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value
initialized by the first value of the input_data[0]
,
The final answer is stored in result[0]
.
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您正在尝试设置一个“关键部分”,但 CUDA 上的这种方法可能会导致整个程序挂起 - 尽可能避免它。
为什么你的代码挂起?
你的内核(
__global__
函数)是由32个线程组执行的,称为warps。单个 warp 内的所有线程同步执行。因此,warp 将在您的do{} while(atomicCAS(&lock,0,1))
中停止,直到您的 warp 中的所有线程成功获取锁。但显然,您希望防止多个线程同时执行临界区。这会导致挂起。替代解决方案
您需要的是“并行缩减算法”。您可以从这里开始阅读:
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (
__global__
function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in yourdo{} while(atomicCAS(&lock,0,1))
until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
你的代码有潜在的竞争。我不确定您是否在共享内存中定义了“max_value”变量,但两者都是错误的。
1)如果'max_value'只是一个局部变量,那么每个线程都保存它的本地副本,这不是实际的最大值(它们只是input_data[0]和input_data[tid]之间的最大值)。在最后一行代码中,所有线程都将自己的 max_value 写入 result[0],这将导致未定义的行为。
2)如果'max_value'是一个共享变量,49个线程将进入if语句块,并且它们将尝试使用锁一次更新'max_value'。但49个线程之间的执行顺序并没有定义,因此某些线程可能会将实际的最大值覆盖为更小的值。您需要再次比较临界区内的最大值。
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max 是一个“缩减” - 查看 SDK 中的 Reduction 示例,并执行 max 而不是求和。
白皮书有点旧,但仍然相当有用:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
最后的优化步骤是使用“warp synchronous”编码以避免不必要的__syncthreads() 调用。
它至少需要 2 次内核调用 - 一次将一堆中间 max() 值写入全局内存,然后另一次获取该数组的 max() 。
如果您想在单个内核调用中完成此操作,请查看 threadfenceReduction SDK 示例。它使用 __threadfence() 和atomicAdd() 来跟踪进度,然后当所有块都完成写入中间结果时,让 1 个块执行最终缩减。
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
变量有不同的访问方式。当您通过 device 定义变量时,该变量将放置在 GPU 全局内存中,并且可由 grid 中的所有线程访问,shared 将变量放置在块共享内存中,并且它只能由该块的线程访问,最后如果您不使用任何关键字,例如 float max_value 那么该变量将被放置在线程寄存器中,并且可以仅在该线程中访问。在您的代码中,每个线程都有局部变量 max_value 并且它不识别其他线程中的变量。
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.