CUDA 中的位数组

发布于 2024-10-08 02:15:26 字数 1745 浏览 0 评论 0原文

我正在 CUDA 中实现埃拉托斯特尼筛法,并且有一个非常奇怪的输出。我使用 unsigned char* 作为数据结构,并使用以下宏来操作位。

#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;

我设置该位来表示它是素数,否则它 = 0。 这是我调用我的内核的地方

size_t p=3;
size_t primeTill = 30;

while(p*p<=primeTill)
{
    if(ISBITSET(h_a, p) == 1){
        int dimA = 30;
        int numBlocks = 1;
        int numThreadsPerBlock = dimA;
        dim3 dimGrid(numBlocks);
        dim3 dimBlock(numThreadsPerBlock);
        cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );        
        cudaThreadSynchronize();    
        reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
        cudaThreadSynchronize();    
        cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
        cudaThreadSynchronize();    
        printf("This is after removing multiples of %d\n", p);
        //Loop
        for(size_t i = 0; i < primeTill +1; i++)
        {
            printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
        }
    }           
    p++;
}

这是我的内核

__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
    while(ISBITSET(d_out, r ) == 1 ){
        CLEARBIT(d_out, r);
    }

    // if(r == 9)
    // {
    //  /* code */
    //  CLEARBIT(d_out, 9);
    // }

}

} 输出应该是: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 而我的输出是: 2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29

如果你看一下内核代码,如果我取消注释这些行,我将得到正确的答案,这意味着没有任何问题用我的循环或我的检查!

Am implementing Sieve of Eratosthenes in CUDA and am having a very weird output. Am using unsigned char* as the data structure and using the following macros to manipulate the bits.

#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;

I set the bit to denote it's a prime number, otherwise it's = 0.
Here is where i call my kernel

size_t p=3;
size_t primeTill = 30;

while(p*p<=primeTill)
{
    if(ISBITSET(h_a, p) == 1){
        int dimA = 30;
        int numBlocks = 1;
        int numThreadsPerBlock = dimA;
        dim3 dimGrid(numBlocks);
        dim3 dimBlock(numThreadsPerBlock);
        cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );        
        cudaThreadSynchronize();    
        reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
        cudaThreadSynchronize();    
        cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
        cudaThreadSynchronize();    
        printf("This is after removing multiples of %d\n", p);
        //Loop
        for(size_t i = 0; i < primeTill +1; i++)
        {
            printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
        }
    }           
    p++;
}

Here is my kernel

__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
    while(ISBITSET(d_out, r ) == 1 ){
        CLEARBIT(d_out, r);
    }

    // if(r == 9)
    // {
    //  /* code */
    //  CLEARBIT(d_out, 9);
    // }

}

}
The output should be:
2, 3, 5, 7, 11, 13, 17, 19, 23, 29
while my output is:
2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29

If you take a look at the kernel code, if i uncomment those lines i will get the correct answer, which means that there is nothing wrong with my loops or my checking!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

溺深海 2024-10-15 02:15:26

多个线程同时访问全局内存中的同一字(字符),因此写入的结果会被损坏。

您可以使用原子操作来防止这种情况,但更好的解决方案是更改您的算法:不要让每个线程筛选 2、3、4、5... 的倍数,而是让每个线程检查 [0.. 7], [8..15], ... 这样每个范围的长度都是 8 位的倍数并且不会发生冲突。

Multiple threads are accessing the same word (char) in global memory simultaneously and thus the written result gets corrupted.

You could use atomic operations to prevent this but the better solution would be to alter your algorithm: Instead of letting every thread sieve out multiples of 2, 3, 4, 5, ... let every thread check a range like [0..7], [8..15], ... so that every range's length is a multiple of 8 bits and no collisions occur.

只怪假的太真实 2024-10-15 02:15:26

我建议首先用方法替换宏。您可以在必要时使用 __host____device__ 前面的方法生成 cpp 和 cu 特定版本。这将消除预处理器执行意外操作的可能性。

现在只需调试导致错误输出的特定代码分支,依次检查每个阶段是否正确,您就会发现问题。

I would suggest replacing the macros with methods to start with. You can use methods preceded by __host__ and __device__ to generate cpp and cu specific versions where necessary. That will eradicate the possibility of the pre-processor doing something unexpected.

Now just debug the particular code branch that is causing the wrong output, checking that each stage is correct in turn and you'll find the problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文