CUDA - 多个内核来计算单个值

发布于 2024-10-21 21:53:27 字数 729 浏览 1 评论 0原文

嘿,我正在尝试编写一个内核,本质上是在 C 中执行以下操作

 float sum = 0.0;
 for(int i = 0; i < N; i++){
   sum += valueArray[i]*valueArray[i];
 }
 sum += sum / N;

目前我的内核中有这个,但它没有给出正确的值。

int i0 = blockIdx.x * blockDim.x + threadIdx.x;

   for(int i=i0; i<N; i += blockDim.x*gridDim.x){
        *d_sum += d_valueArray[i]*d_valueArray[i];
    }

  *d_sum= __fdividef(*d_sum, N);

用于调用内核的代码是

  kernelName<<<64,128>>>(N, d_valueArray, d_sum);
  cudaMemcpy(&sum, d_sum, sizeof(float) , cudaMemcpyDeviceToHost);

我认为每个内核都在计算部分和,但最终的除法语句没有考虑每个线程的累积值。每个内核都为 d_sum 生成自己的最终值?

有谁知道我怎样才能有效地做到这一点?也许在线程之间使用共享内存?我对 GPU 编程非常陌生。干杯

Hey, I'm trying to write a kernel to essentially do the following in C

 float sum = 0.0;
 for(int i = 0; i < N; i++){
   sum += valueArray[i]*valueArray[i];
 }
 sum += sum / N;

At the moment I have this inside my kernel, but it is not giving correct values.

int i0 = blockIdx.x * blockDim.x + threadIdx.x;

   for(int i=i0; i<N; i += blockDim.x*gridDim.x){
        *d_sum += d_valueArray[i]*d_valueArray[i];
    }

  *d_sum= __fdividef(*d_sum, N);

The code used to call the kernel is

  kernelName<<<64,128>>>(N, d_valueArray, d_sum);
  cudaMemcpy(&sum, d_sum, sizeof(float) , cudaMemcpyDeviceToHost);

I think that each kernel is calculating a partial sum, but the final divide statement is not taking into account the accumulated value from each of the threads. Every kernel is producing it's own final value for d_sum?

Does anyone know how could I go about doing this in an efficient way? Maybe using shared memory between threads? I'm very new to GPU programming. Cheers

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

ㄟ。诗瑗 2024-10-28 21:53:27

您正在从多个线程更新 d_sum 。

请参阅以下 SDK 示例:

http://developer.download.nvidia .com/compute/cuda/sdk/website/samples.html

这是该示例的代码。请注意这是一个两步过程。在尝试累积最终结果之前,先对每个线程块求和,然后对 __syncthreads 求和。

#define ACCUM_N 1024
__global__ void scalarProdGPU(
    float *d_C,
    float *d_A,
    float *d_B,
    int vectorN,
    int elementN
){
    //Accumulators cache
    __shared__ float accumResult[ACCUM_N];

    ////////////////////////////////////////////////////////////////////////////
    // Cycle through every pair of vectors,
    // taking into account that vector counts can be different
    // from total number of thread blocks
    ////////////////////////////////////////////////////////////////////////////
    for(int vec = blockIdx.x; vec < vectorN; vec += gridDim.x){
        int vectorBase = IMUL(elementN, vec);
        int vectorEnd  = vectorBase + elementN;

        ////////////////////////////////////////////////////////////////////////
        // Each accumulator cycles through vectors with
        // stride equal to number of total number of accumulators ACCUM_N
        // At this stage ACCUM_N is only preferred be a multiple of warp size
        // to meet memory coalescing alignment constraints.
        ////////////////////////////////////////////////////////////////////////
        for(int iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){
            float sum = 0;

            for(int pos = vectorBase + iAccum; pos < vectorEnd; pos += ACCUM_N)
                sum += d_A[pos] * d_B[pos];

            accumResult[iAccum] = sum;
        }

        ////////////////////////////////////////////////////////////////////////
        // Perform tree-like reduction of accumulators' results.
        // ACCUM_N has to be power of two at this stage
        ////////////////////////////////////////////////////////////////////////
        for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1){
            __syncthreads();
            for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
                accumResult[iAccum] += accumResult[stride + iAccum];
        }

        if(threadIdx.x == 0) d_C[vec] = accumResult[0];
    }
}

You're updating d_sum from multiple threads.

See the following SDK sample:

http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html

Here's the code from that sample. Note how it's a two step process. Sum each thread block and then __syncthreads before attempting to accumulate the final result.

#define ACCUM_N 1024
__global__ void scalarProdGPU(
    float *d_C,
    float *d_A,
    float *d_B,
    int vectorN,
    int elementN
){
    //Accumulators cache
    __shared__ float accumResult[ACCUM_N];

    ////////////////////////////////////////////////////////////////////////////
    // Cycle through every pair of vectors,
    // taking into account that vector counts can be different
    // from total number of thread blocks
    ////////////////////////////////////////////////////////////////////////////
    for(int vec = blockIdx.x; vec < vectorN; vec += gridDim.x){
        int vectorBase = IMUL(elementN, vec);
        int vectorEnd  = vectorBase + elementN;

        ////////////////////////////////////////////////////////////////////////
        // Each accumulator cycles through vectors with
        // stride equal to number of total number of accumulators ACCUM_N
        // At this stage ACCUM_N is only preferred be a multiple of warp size
        // to meet memory coalescing alignment constraints.
        ////////////////////////////////////////////////////////////////////////
        for(int iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){
            float sum = 0;

            for(int pos = vectorBase + iAccum; pos < vectorEnd; pos += ACCUM_N)
                sum += d_A[pos] * d_B[pos];

            accumResult[iAccum] = sum;
        }

        ////////////////////////////////////////////////////////////////////////
        // Perform tree-like reduction of accumulators' results.
        // ACCUM_N has to be power of two at this stage
        ////////////////////////////////////////////////////////////////////////
        for(int stride = ACCUM_N / 2; stride > 0; stride >>= 1){
            __syncthreads();
            for(int iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
                accumResult[iAccum] += accumResult[stride + iAccum];
        }

        if(threadIdx.x == 0) d_C[vec] = accumResult[0];
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文