CUDA 中的位数组
我正在 CUDA 中实现埃拉托斯特尼筛法,并且有一个非常奇怪的输出。我使用 unsigned char* 作为数据结构,并使用以下宏来操作位。
#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;
我设置该位来表示它是素数,否则它 = 0。 这是我调用我的内核的地方
size_t p=3;
size_t primeTill = 30;
while(p*p<=primeTill)
{
if(ISBITSET(h_a, p) == 1){
int dimA = 30;
int numBlocks = 1;
int numThreadsPerBlock = dimA;
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
cudaThreadSynchronize();
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
cudaThreadSynchronize();
cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
cudaThreadSynchronize();
printf("This is after removing multiples of %d\n", p);
//Loop
for(size_t i = 0; i < primeTill +1; i++)
{
printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
}
}
p++;
}
这是我的内核
__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
while(ISBITSET(d_out, r ) == 1 ){
CLEARBIT(d_out, r);
}
// if(r == 9)
// {
// /* code */
// CLEARBIT(d_out, 9);
// }
}
} 输出应该是: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29 而我的输出是: 2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29
如果你看一下内核代码,如果我取消注释这些行,我将得到正确的答案,这意味着没有任何问题用我的循环或我的检查!
Am implementing Sieve of Eratosthenes in CUDA and am having a very weird output. Am using unsigned char* as the data structure and using the following macros to manipulate the bits.
#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
#define SETBIT(x,i) x[i>>3]|=(1<<(i&7));
#define CLEARBIT(x,i) x[i>>3]&=(1<<(i&7))^0xFF;
I set the bit to denote it's a prime number, otherwise it's = 0.
Here is where i call my kernel
size_t p=3;
size_t primeTill = 30;
while(p*p<=primeTill)
{
if(ISBITSET(h_a, p) == 1){
int dimA = 30;
int numBlocks = 1;
int numThreadsPerBlock = dimA;
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );
cudaThreadSynchronize();
reverseArrayBlock<<< dimGrid, dimBlock >>>( d_a, primeTill, p );
cudaThreadSynchronize();
cudaMemcpy( h_a, d_a, memSize, cudaMemcpyDeviceToHost );
cudaThreadSynchronize();
printf("This is after removing multiples of %d\n", p);
//Loop
for(size_t i = 0; i < primeTill +1; i++)
{
printf("Bit %d is %d\n", i, ISBITSET(h_a, i));
}
}
p++;
}
Here is my kernel
__global__ void reverseArrayBlock(unsigned char *d_out, int size, size_t p)
{
int id = blockIdx.x*blockDim.x + threadIdx.x;
int r = id*p;
if(id >= p && r <= size )
{
while(ISBITSET(d_out, r ) == 1 ){
CLEARBIT(d_out, r);
}
// if(r == 9)
// {
// /* code */
// CLEARBIT(d_out, 9);
// }
}
}
The output should be:
2, 3, 5, 7, 11, 13, 17, 19, 23, 29
while my output is:
2, 3, 5, 9, 7, 11, 13, 17, 19, 23, 29
If you take a look at the kernel code, if i uncomment those lines i will get the correct answer, which means that there is nothing wrong with my loops or my checking!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
多个线程同时访问全局内存中的同一字(字符),因此写入的结果会被损坏。
您可以使用原子操作来防止这种情况,但更好的解决方案是更改您的算法:不要让每个线程筛选 2、3、4、5... 的倍数,而是让每个线程检查 [0.. 7], [8..15], ... 这样每个范围的长度都是 8 位的倍数并且不会发生冲突。
Multiple threads are accessing the same word (char) in global memory simultaneously and thus the written result gets corrupted.
You could use atomic operations to prevent this but the better solution would be to alter your algorithm: Instead of letting every thread sieve out multiples of 2, 3, 4, 5, ... let every thread check a range like [0..7], [8..15], ... so that every range's length is a multiple of 8 bits and no collisions occur.
我建议首先用方法替换宏。您可以在必要时使用
__host__
和__device__
前面的方法生成 cpp 和 cu 特定版本。这将消除预处理器执行意外操作的可能性。现在只需调试导致错误输出的特定代码分支,依次检查每个阶段是否正确,您就会发现问题。
I would suggest replacing the macros with methods to start with. You can use methods preceded by
__host__
and__device__
to generate cpp and cu specific versions where necessary. That will eradicate the possibility of the pre-processor doing something unexpected.Now just debug the particular code branch that is causing the wrong output, checking that each stage is correct in turn and you'll find the problem.