是否有方法能够阻止某些块直到满足某些条件?

发布于 2024-11-30 11:17:09 字数 391 浏览 0 评论 0原文

我想阻止一些块,直到一个变量被设置为特定值。所以我编写这段代码来测试一个简单的 do-while 循环是否有效。

__device__ int tag = 0;
__global__ void kernel() {
    if ( threadIdx.x == 0 ) {
        volatile int v;
        do {
            v = tag;
        }
        while ( v == 0 );
    }
    __syncthreads();
    return ;
}

然而,它不起作用(没有发生死循环,很奇怪)。

我想问是否有任何其他方法能够阻止某些块,直到满足某些条件,或者对代码进行一些更改是否有效。

I want to block some blocks until one variable is set to a particular value. So I write this code to test if a simple do-while loop will work.

__device__ int tag = 0;
__global__ void kernel() {
    if ( threadIdx.x == 0 ) {
        volatile int v;
        do {
            v = tag;
        }
        while ( v == 0 );
    }
    __syncthreads();
    return ;
}

However, it doesn't work(No dead loop occurs, very strange).

I want to ask if any other method is able to block some blocks until some conditions satisfied or if some changes on the code will work.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

女中豪杰 2024-12-07 11:17:09

目前在 CUDA 中没有可靠的方法来执行块间同步。

有一些黑客方法可以在总线程数适中的块之间实现某种方式的锁定或阻塞,但它们利用了执行模型中未定义的行为,这些行为不能保证在所有硬件上以相同的方式运行或在将来继续工作。确保块之间同步或阻塞的唯一可靠方法是我们单独启动内核。如果您的算法在没有块间同步的情况下无法工作,那么您要么需要新的算法,要么您的应用程序非常不适合 GPU 架构。

There currently is no reliable way to perform inter-block synchronization in CUDA.

There are hacky ways to achieve some manner of locking or blocking between blocks with a modest number of total threads, but they exploit undefined behaviour in the execution model which are not guaranteed to run the same way on all hardware or continue to work in the future. The only reliable way to ensure synchronization or blocking between blocks is to us separate kernel launches. If you can't make your algorithm work without interblock synchronization, you either need a new algorithm, or your application is a very poor fit for the GPU architecture.

在梵高的星空下 2024-12-07 11:17:09

这是我尝试看看是否有效的一种黑客方法。

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

__global__ static
void kernel(int *count, float *data)
{
    count += threadIdx.x;
    data += gridDim.x * threadIdx.x;
    int i = blockIdx.x;
    if (i < gridDim.x - 1) {
        data[i] = i + 1;
        atomicAdd(count, 1);
        return;
    }

    while (atomicMin(count, i) != i);

    float tmp = i + 1;
    for (int j = 0; j < i; j++) tmp += data[j];

    data[i] = tmp;
}

int main(int argc, char **args)
{
        int num = 100;
    if (argc >= 2) num = atoi(args[1]);

    int bytes = num * sizeof(float) * 32;
    float *d_data; cudaMalloc((void **)&d_data, bytes);
    float *h_data = (float *)malloc(bytes);
    for (int i = 0; i < 32 * num; i++) h_data[i] = -1; // Being safe                                                                                                                           

    int h_count[32] = {1};
    int *d_count; cudaMalloc((void **)&d_count, 32 * sizeof(int));
    cudaMemcpy(d_count, &h_count, 32 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
    kernel<<<num, 32>>>(d_count, d_data);
    cudaMemcpy(&h_count, d_count, 32 * sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 32; i++) {
        printf("sum of first %d from thread %d is %d \n", num, i, (int)h_data[num -1]);
        h_data += num;
    }

    cudaFree(d_count);
    cudaFree(d_data);
    free(h_data - num * 32);
}

我不能保证这永远有效。但我的卡(320M)上的断点似乎是 num = 5796。也许每张卡都有某种不同的硬件限制?

编辑

这个问题的答案是 n * (n + 1) / 2 > 2^24 对于 n > 5795(这是单精度限制)。超出此点的整数值的准确性是未定义的。感谢 talonmies 指出这一点。

./a.out 5795
sum of first 5795 from thread 0 is 16793910 
sum of first 5795 from thread 1 is 16793910 
sum of first 5795 from thread 2 is 16793910 
sum of first 5795 from thread 3 is 16793910 
sum of first 5795 from thread 4 is 16793910 
sum of first 5795 from thread 5 is 16793910 
sum of first 5795 from thread 6 is 16793910 
sum of first 5795 from thread 7 is 16793910 
sum of first 5795 from thread 8 is 16793910 
sum of first 5795 from thread 9 is 16793910 
sum of first 5795 from thread 10 is 16793910 
sum of first 5795 from thread 11 is 16793910 
sum of first 5795 from thread 12 is 16793910 
sum of first 5795 from thread 13 is 16793910 
sum of first 5795 from thread 14 is 16793910 
sum of first 5795 from thread 15 is 16793910 
sum of first 5795 from thread 16 is 16793910 
sum of first 5795 from thread 17 is 16793910 
sum of first 5795 from thread 18 is 16793910 
sum of first 5795 from thread 19 is 16793910 
sum of first 5795 from thread 20 is 16793910 
sum of first 5795 from thread 21 is 16793910 
sum of first 5795 from thread 22 is 16793910 
sum of first 5795 from thread 23 is 16793910 
sum of first 5795 from thread 24 is 16793910 
sum of first 5795 from thread 25 is 16793910 
sum of first 5795 from thread 26 is 16793910 
sum of first 5795 from thread 27 is 16793910 
sum of first 5795 from thread 28 is 16793910 
sum of first 5795 from thread 29 is 16793910 
sum of first 5795 from thread 30 is 16793910 
sum of first 5795 from thread 31 is 16793910 

--

我编辑了以前的代码,该代码仅使用一个块。这更能代表现实世界的线程/块(内存访问很奇怪并且会非常慢,但这样做是为了快速移植我的旧测试代码以使用多个线程)。

看起来在某些情况下你可以跨块同步,但主要取决于你事先了解某些事情(对于这种特殊情况,我只同步 n - 1 个块,然后对最后一个块执行疯狂无用的计数)。

这只是一个概念证明,请勿认真对待代码

Here is a hackish way I tried to see if it will work.

#include <stdio.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>

__global__ static
void kernel(int *count, float *data)
{
    count += threadIdx.x;
    data += gridDim.x * threadIdx.x;
    int i = blockIdx.x;
    if (i < gridDim.x - 1) {
        data[i] = i + 1;
        atomicAdd(count, 1);
        return;
    }

    while (atomicMin(count, i) != i);

    float tmp = i + 1;
    for (int j = 0; j < i; j++) tmp += data[j];

    data[i] = tmp;
}

int main(int argc, char **args)
{
        int num = 100;
    if (argc >= 2) num = atoi(args[1]);

    int bytes = num * sizeof(float) * 32;
    float *d_data; cudaMalloc((void **)&d_data, bytes);
    float *h_data = (float *)malloc(bytes);
    for (int i = 0; i < 32 * num; i++) h_data[i] = -1; // Being safe                                                                                                                           

    int h_count[32] = {1};
    int *d_count; cudaMalloc((void **)&d_count, 32 * sizeof(int));
    cudaMemcpy(d_count, &h_count, 32 * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_data, h_data, bytes, cudaMemcpyHostToDevice);
    kernel<<<num, 32>>>(d_count, d_data);
    cudaMemcpy(&h_count, d_count, 32 * sizeof(int), cudaMemcpyDeviceToHost);
    cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

    for (int i = 0; i < 32; i++) {
        printf("sum of first %d from thread %d is %d \n", num, i, (int)h_data[num -1]);
        h_data += num;
    }

    cudaFree(d_count);
    cudaFree(d_data);
    free(h_data - num * 32);
}

I can not guarantee this will always work. But the breaking point on my card (320M) seems to be for num = 5796. Perhaps a hardware limit of some kind different for each card ?

EDIT

The answer to this is that n * (n + 1) / 2 > 2^24 for n > 5795 (which is the single precision limit). The accuracy of integer values beyond this point is undefined. Thanks to talonmies for pointing it out.

./a.out 5795
sum of first 5795 from thread 0 is 16793910 
sum of first 5795 from thread 1 is 16793910 
sum of first 5795 from thread 2 is 16793910 
sum of first 5795 from thread 3 is 16793910 
sum of first 5795 from thread 4 is 16793910 
sum of first 5795 from thread 5 is 16793910 
sum of first 5795 from thread 6 is 16793910 
sum of first 5795 from thread 7 is 16793910 
sum of first 5795 from thread 8 is 16793910 
sum of first 5795 from thread 9 is 16793910 
sum of first 5795 from thread 10 is 16793910 
sum of first 5795 from thread 11 is 16793910 
sum of first 5795 from thread 12 is 16793910 
sum of first 5795 from thread 13 is 16793910 
sum of first 5795 from thread 14 is 16793910 
sum of first 5795 from thread 15 is 16793910 
sum of first 5795 from thread 16 is 16793910 
sum of first 5795 from thread 17 is 16793910 
sum of first 5795 from thread 18 is 16793910 
sum of first 5795 from thread 19 is 16793910 
sum of first 5795 from thread 20 is 16793910 
sum of first 5795 from thread 21 is 16793910 
sum of first 5795 from thread 22 is 16793910 
sum of first 5795 from thread 23 is 16793910 
sum of first 5795 from thread 24 is 16793910 
sum of first 5795 from thread 25 is 16793910 
sum of first 5795 from thread 26 is 16793910 
sum of first 5795 from thread 27 is 16793910 
sum of first 5795 from thread 28 is 16793910 
sum of first 5795 from thread 29 is 16793910 
sum of first 5795 from thread 30 is 16793910 
sum of first 5795 from thread 31 is 16793910 

--

I edited my former code which was using just one block. This is more representative of a real world threads / blocks (the memory accesses are weird and will be slow as hell, but they were done to quickly port my old test code to use multiple threads).

Looks like there are some cases when you can synchronize across blocks, but mostly depends on you knowing certain things before hand (for this particular case, I was only syncing n - 1 blocks before performing an insanely useless count on the last block).

This is a proof of concept only, do not take the code seriously

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文