CUDA 在执行期间组合线程独立（??）变量

发布于 2024-11-01 00:24:34 字数 1460 浏览 0 评论 0 原文

伙计们，如果标题令人困惑，我深表歉意。我虽然很长很努力，但无法想出正确的方法来用一句话来表达这个问题。所以这里有更多细节。我正在做基本的图像减法，其中第二个图像已被修改，我需要找到对图像进行了多少更改的比率。为此我使用了以下代码。两个图像均为 128x1024。

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        den++;
        diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
        if(diff[i * 1024 + j] < error)
        {
            num++;
        }
    }
}
ratio = num/den;

上面的代码在 CPU 上运行良好，但我想尝试在 CUDA 上执行此操作。为此，我可以设置 CUDA 来执行图像的基本减法（下面的代码），但我不知道如何执行条件 if 语句来获取我的比率。

__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index < size)
        result[index] = orig[index] - modified[index];
}

所以，到目前为止它是有效的，但我无法弄清楚如何并行化每个线程中的 num 和 den 计数器来计算所有线程执行结束时的比率。对我来说，感觉 num 和 den 计数器独立于线程，因为每次我尝试使用它们时，它们似乎只增加一次。

任何帮助将不胜感激，因为我刚刚开始使用 CUDA，我在网上看到的每个示例似乎都不适用于我需要做的事情。

编辑：修复了我的幼稚代码。忘记在代码中输入主要条件之一。这是漫长漫长的一天。

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        if(modified[i * 1024 + j] < 400.0)  //400.0 threshold value to ignore noise
        {
            den++;  
            diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
            if(diff[i * 1024 + j] < error)
            {
                num++;
            }
        }
    }
}
ratio = num/den;

原文

Guys I apologize if the title is confusing. I though long and hard and couldn't come up with proper way to phrase the question in a single line. So here's more detail. I am doing a basic image subtraction where the second image has been modified and I need to find the ratio of how much change was done to the image. for this I used the following code. Both images are 128x1024.

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        den++;
        diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
        if(diff[i * 1024 + j] < error)
        {
            num++;
        }
    }
}
ratio = num/den;

The above code works fine on the CPU but I want to try to do this on CUDA. For this I can setup CUDA to do the basic subtraction of the images (code below) but I can't figure out how to do the conditional if statement to get my ratio out.

__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index < size)
        result[index] = orig[index] - modified[index];
}

So, up to this point it works but I cannot figure out how to parrallelize the num and den counters in each thread to calculate the ratio at the end of all the thread executions. To me it feels like the num and den counders are independent of the threads as every time I have tried to use them it seems they get incremented only once.

Any help will be appreciated as I am just starting out in CUDA and every example I see online never seems to apply to what I need to do.

EDIT: Fixed my naive code. Forgot to type one of the main condition in the code. It was a long long day.

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        if(modified[i * 1024 + j] < 400.0)  //400.0 threshold value to ignore noise
        {
            den++;  
            diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
            if(diff[i * 1024 + j] < error)
            {
                num++;
            }
        }
    }
}
ratio = num/den;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

杀手六號 2024-11-08 00:24:34

在所有线程之间执行全局求和所需的操作称为“并行归约”。虽然您可以使用原子操作来执行此操作，但我不推荐它。 CUDA SDK 中有一个归约内核和一篇讨论该技术的非常好的论文，值得一读。

如果我编写代码来执行您想要的操作，它可能如下所示：

template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result, 
                            int *count, const float error)
{
    __shared__ volatile float buff[blocksize];

    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    int count = 0;
    for(int i=index; i<n; i+=stride) {
        val = orig[index] - modified[index];
        count += (val < error);
        result[index] = val;
    }

    buff[threadIdx.x] = count;
    __syncthreads();


    // Parallel reduction in shared memory using 1 warp
    if (threadId.x < warpSize) {

        for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
            buff[threadIdx.x] += buff[i];

        if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
        if (threadIdx.x < 8)  buff[threadIdx.x] +=buff[threadIdx.x + 8];
        if (threadIdx.x < 4)  buff[threadIdx.x] +=buff[threadIdx.x + 4];
        if (threadIdx.x < 2)  buff[threadIdx.x] +=buff[threadIdx.x + 2];
        if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
    }
}

第一节执行串行代码的操作 - 计算差值和小于错误的线程本地元素总数。请注意，我编写了此版本，以便每个线程都设计为处理多个输入数据条目。这样做是为了帮助抵消随后并行减少的计算成本，其想法是您将使用比输入数据集条目更少的块和线程。

第二节是减少本身，在共享内存中完成。它实际上是一个“类似树”的操作，其中单个线程块内的线程局部小计集合的大小首先总计为 32 个小计，然后组合这些小计，直到获得该块的最终小计，并且然后存储的是块的总数。您最终将得到一小部分计数小计，每个小计对应您启动的每个块，可以将其复制回主机并在那里计算您需要的最终结果。

请注意，我在浏览器中对此进行了编码，但尚未对其进行编译，可能会出现错误，但它应该让您了解您正在尝试执行的操作的“高级”版本如何工作。

The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.

If I were writing code to do what you want, it would probably look like this:

template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result, 
                            int *count, const float error)
{
    __shared__ volatile float buff[blocksize];

    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    int count = 0;
    for(int i=index; i<n; i+=stride) {
        val = orig[index] - modified[index];
        count += (val < error);
        result[index] = val;
    }

    buff[threadIdx.x] = count;
    __syncthreads();


    // Parallel reduction in shared memory using 1 warp
    if (threadId.x < warpSize) {

        for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
            buff[threadIdx.x] += buff[i];

        if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
        if (threadIdx.x < 8)  buff[threadIdx.x] +=buff[threadIdx.x + 8];
        if (threadIdx.x < 4)  buff[threadIdx.x] +=buff[threadIdx.x + 4];
        if (threadIdx.x < 2)  buff[threadIdx.x] +=buff[threadIdx.x + 2];
        if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
    }
}

The first stanza does what your serial code does - computes a difference and a thread local total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.

The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the block. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.

Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work.

回复收藏 0 原文