CUDA 在执行期间组合线程独立(??)变量

发布于 2024-11-01 00:24:34 字数 1460 浏览 0 评论 0 原文

伙计们,如果标题令人困惑,我深表歉意。我虽然很长很努力,但无法想出正确的方法来用一句话来表达这个问题。所以这里有更多细节。我正在做基本的图像减法,其中第二个图像已被修改,我需要找到对图像进行了多少更改的比率。为此我使用了以下代码。两个图像均为 128x1024。

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        den++;
        diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
        if(diff[i * 1024 + j] < error)
        {
            num++;
        }
    }
}
ratio = num/den;

上面的代码在 CPU 上运行良好,但我想尝试在 CUDA 上执行此操作。为此,我可以设置 CUDA 来执行图像的基本减法(下面的代码),但我不知道如何执行条件 if 语句来获取我的比率。

__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index < size)
        result[index] = orig[index] - modified[index];
}

所以,到目前为止它是有效的,但我无法弄清楚如何并行化每个线程中的 num 和 den 计数器来计算所有线程执行结束时的比率。对我来说,感觉 num 和 den 计数器独立于线程,因为每次我尝试使用它们时,它们似乎只增加一次。

任何帮助将不胜感激,因为我刚刚开始使用 CUDA,我在网上看到的每个示例似乎都不适用于我需要做的事情。

编辑:修复了我的幼稚代码。忘记在代码中输入主要条件之一。这是漫长漫长的一天。

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        if(modified[i * 1024 + j] < 400.0)  //400.0 threshold value to ignore noise
        {
            den++;  
            diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
            if(diff[i * 1024 + j] < error)
            {
                num++;
            }
        }
    }
}
ratio = num/den;

Guys I apologize if the title is confusing. I though long and hard and couldn't come up with proper way to phrase the question in a single line. So here's more detail. I am doing a basic image subtraction where the second image has been modified and I need to find the ratio of how much change was done to the image. for this I used the following code. Both images are 128x1024.

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        den++;
        diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
        if(diff[i * 1024 + j] < error)
        {
            num++;
        }
    }
}
ratio = num/den;

The above code works fine on the CPU but I want to try to do this on CUDA. For this I can setup CUDA to do the basic subtraction of the images (code below) but I can't figure out how to do the conditional if statement to get my ratio out.

__global__ void calcRatio(float *orig, float *modified, int size, float *result)
{
    int index = threadIdx.x + blockIdx.x * blockDim.x;
    if(index < size)
        result[index] = orig[index] - modified[index];
}

So, up to this point it works but I cannot figure out how to parrallelize the num and den counters in each thread to calculate the ratio at the end of all the thread executions. To me it feels like the num and den counders are independent of the threads as every time I have tried to use them it seems they get incremented only once.

Any help will be appreciated as I am just starting out in CUDA and every example I see online never seems to apply to what I need to do.

EDIT: Fixed my naive code. Forgot to type one of the main condition in the code. It was a long long day.

for(int i = 0; i < 128; i++)
{
    for(int j = 0; j < 1024; j++)
    {
        if(modified[i * 1024 + j] < 400.0)  //400.0 threshold value to ignore noise
        {
            den++;  
            diff[i * 1024 + j] = orig[i * 1024 + j] - modified[i * 1024 + j];
            if(diff[i * 1024 + j] < error)
            {
                num++;
            }
        }
    }
}
ratio = num/den;

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

杀手六號 2024-11-08 00:24:34

在所有线程之间执行全局求和所需的操作称为“并行归约”。虽然您可以使用原子操作来执行此操作,但我不推荐它。 CUDA SDK 中有一个归约内核和一篇讨论该技术的非常好的论文,值得一读。

如果我编写代码来执行您想要的操作,它可能如下所示:

template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result, 
                            int *count, const float error)
{
    __shared__ volatile float buff[blocksize];

    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    int count = 0;
    for(int i=index; i<n; i+=stride) {
        val = orig[index] - modified[index];
        count += (val < error);
        result[index] = val;
    }

    buff[threadIdx.x] = count;
    __syncthreads();


    // Parallel reduction in shared memory using 1 warp
    if (threadId.x < warpSize) {

        for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
            buff[threadIdx.x] += buff[i];

        if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
        if (threadIdx.x < 8)  buff[threadIdx.x] +=buff[threadIdx.x + 8];
        if (threadIdx.x < 4)  buff[threadIdx.x] +=buff[threadIdx.x + 4];
        if (threadIdx.x < 2)  buff[threadIdx.x] +=buff[threadIdx.x + 2];
        if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
    }
}

第一节执行串行代码的操作 - 计算差值和小于错误的线程本地元素总数。请注意,我编写了此版本,以便每个线程都设计为处理多个输入数据条目。这样做是为了帮助抵消随后并行减少的计算成本,其想法是您将使用比输入数据集条目更少的块和线程。

第二节是减少本身,在共享内存中完成。它实际上是一个“类似树”的操作,其中单个线程块内的线程局部小计集合的大小首先总计为 32 个小计,然后组合这些小计,直到获得该块的最终小计,并且然后存储的是的总数。您最终将得到一小部分计数小计,每个小计对应您启动的每个块,可以将其复制回主机并在那里计算您需要的最终结果。

请注意,我在浏览器中对此进行了编码,但尚未对其进行编译,可能会出现错误,但它应该让您了解您正在尝试执行的操作的“高级”版本如何工作。

The operation you need to use to perform global summation across all the threads is known as a "parallel reduction". While you could use atomic operations to do this, I would not recommend it. There is a reduction kernel and a very good paper discussing the technique in the CUDA SDK, it is worth reading.

If I were writing code to do what you want, it would probably look like this:

template <int blocksize>
__global__ void calcRatio(float *orig, float *modified, int size, float *result, 
                            int *count, const float error)
{
    __shared__ volatile float buff[blocksize];

    int index = threadIdx.x + blockIdx.x * blockDim.x;
    int stride = blockDim.x * gridDim.x;

    int count = 0;
    for(int i=index; i<n; i+=stride) {
        val = orig[index] - modified[index];
        count += (val < error);
        result[index] = val;
    }

    buff[threadIdx.x] = count;
    __syncthreads();


    // Parallel reduction in shared memory using 1 warp
    if (threadId.x < warpSize) {

        for(int i=threadIdx.x + warpSize; i<blocksize; i+= warpSize) {
            buff[threadIdx.x] += buff[i];

        if (threadIdx.x < 16) buff[threadIdx.x] +=buff[threadIdx.x + 16];
        if (threadIdx.x < 8)  buff[threadIdx.x] +=buff[threadIdx.x + 8];
        if (threadIdx.x < 4)  buff[threadIdx.x] +=buff[threadIdx.x + 4];
        if (threadIdx.x < 2)  buff[threadIdx.x] +=buff[threadIdx.x + 2];
        if (threadIdx.x == 0) count[blockIdx.x] = buff[0] + buff[1];
    }
}

The first stanza does what your serial code does - computes a difference and a thread local total of elements which are less than error. Note I have written this version so that each thread is designed to process more than one entry of the input data. This has been done to help offset the computational cost of the parallel reduction that follows, and the idea is that you would use fewer blocks and threads than there were input data set entries.

The second stanza is the reduction itself, done in shared memory. It is effectively a "tree like" operation where the size of the set of thread local subtotals within a single block of threads is first summed down to 32 subtotals, then the subtotals are combined until there is the final subtotal for the block, and that is then stored is the total for the block. You will wind up with a small list of sub totals in count, one for each block you launched, which can be copied back to the host and the final result you need calculated there.

Please note I coded this in the browser and haven't compiled it, there might be errors, but it should give an idea about how an "advanced" version of what you are trying to do would work.

旧夏天 2024-11-08 00:24:34

分母非常简单,因为它只是大小。

分子更麻烦,因为给定线程的它的值取决于所有先前的值。您将必须连续执行该操作。

您正在寻找的东西可能是atomicAdd。但它非常慢。

我想你会发现这个问题很重要。你的数字基本上是全球数据。
CUDA array-to-array sum

或者,您可以转储错误检查的结果到一个数组中。然后可以并行计算结果。这会有点棘手,但我认为这样的事情会扩大规模: http://tekpool.wordpress.com/2006/09/25/bit-count-parallel-counting-mit-hakmem/

The denominator is pretty simple, since it's just the size.

The numerator is more troublesome, since its value for a given thread depends on all previous values. You're going to have to do that operation serially.

The thing you're looking for is probably atomicAdd. It's very slow, though.

I think you'd find this question relevant. Your num is basically global data.
CUDA array-to-array sum

Alternatively, you could dump the results of the error check into an array. Counting the results could then be parallelized. It would be a little tricky, but I think something like this would scale up: http://tekpool.wordpress.com/2006/09/25/bit-count-parallel-counting-mit-hakmem/

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文