错误:CUDA 同步上的 BFS
我的以下代码出现错误,当它运行时,一些图权重被覆盖,但是 Xa 数组(它保留已访问过的数组)和 __syncthreads() 函数不应该发生这种情况...有人帮忙吗?
struct Node
{
int begin; // begining of the substring
int num; // size of the sub-string
};
__global__ void BFS (Node *Va, int *Ea, bool *Fa, bool *Xa, int *Ca, bool *parada)
{
int tid = threadIdx.x;
if (Fa[tid] == true && Xa[tid] == false)
{
Fa[tid] = false;
__syncthreads();
// Va begin is where it's edges' subarray begins, Va is it's
// number of elements
for (int i = Va[tid].begin; i < (Va[tid].begin + Va[tid].num); i++)
{
int nid = Ea[i];
if (Xa[nid] == false)
{
Ca[nid] = Ca[tid] + 1;
Fa[nid] = true;
*parada = true;
}
}
Xa[tid] = true;
}
}
// The BFS frontier corresponds to all the nodes being processed
// at the current level.
int main()
{
//descrição do grafo
struct Node node[4];
node[0].begin=0;
node[0].num=2;
node[1].begin=1;
node[1].num=0;
node[2].begin=2;
node[2].num=2;
node[3].begin=1;
node[3].num=0;
int edges[]={1,2,3,1};
bool frontier[4]={false};
bool visited[4]={false};
int custo[4]={0};
int source=0;
frontier[source]=true;
Node* Va;
cudaMalloc((void**)&Va,sizeof(Node)*4);
cudaMemcpy(Va,node,sizeof(Node)*4,cudaMemcpyHostToDevice);
int* Ea;
cudaMalloc((void**)&Ea,sizeof(Node)*4);
cudaMemcpy(Ea,edges,sizeof(Node)*4,cudaMemcpyHostToDevice);
bool* Fa;
cudaMalloc((void**)&Fa,sizeof(bool)*4);
cudaMemcpy(Fa,frontier,sizeof(bool)*4,cudaMemcpyHostToDevice);
bool* Xa;
cudaMalloc((void**)&Xa,sizeof(bool)*4);
cudaMemcpy(Xa,visited,sizeof(bool)*4,cudaMemcpyHostToDevice);
int* Ca;
cudaMalloc((void**)&Ca,sizeof(int)*4);
cudaMemcpy(Ca,custo,sizeof(int)*4,cudaMemcpyHostToDevice);
dim3 threads(4,1,1);
bool para;
bool* parada;
cudaMalloc((void**)¶da,sizeof(bool));
printf("\n");
int n=1;
do{
para=false;
cudaMemcpy(parada,¶,sizeof(bool),cudaMemcpyHostToDevice);
BFS <<<1,threads>>>(Va,Ea,Fa,Xa,Ca,parada);
CUT_CHECK_ERROR("kernel1 execution failed");
cudaMemcpy(¶,parada,sizeof(bool),cudaMemcpyDeviceToHost);
printf("Run number: %d >> ",n);
cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);
for(int i=0;i<4;i++)
printf("%d ",custo[i]);
printf("\n");
n++;
}while(para);
printf("\nFinal:\n");
cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);
for(int i=0;i<4;i++)
printf("%d ",custo[i]);
printf("\n");
}
My following code got an error, when it runs, some of the graph weights are being overwritten, but that should not be happening with the Xa array(which keeps which ones have already been visited) and the __syncthreads() function... May someone help?
struct Node
{
int begin; // begining of the substring
int num; // size of the sub-string
};
__global__ void BFS (Node *Va, int *Ea, bool *Fa, bool *Xa, int *Ca, bool *parada)
{
int tid = threadIdx.x;
if (Fa[tid] == true && Xa[tid] == false)
{
Fa[tid] = false;
__syncthreads();
// Va begin is where it's edges' subarray begins, Va is it's
// number of elements
for (int i = Va[tid].begin; i < (Va[tid].begin + Va[tid].num); i++)
{
int nid = Ea[i];
if (Xa[nid] == false)
{
Ca[nid] = Ca[tid] + 1;
Fa[nid] = true;
*parada = true;
}
}
Xa[tid] = true;
}
}
// The BFS frontier corresponds to all the nodes being processed
// at the current level.
int main()
{
//descrição do grafo
struct Node node[4];
node[0].begin=0;
node[0].num=2;
node[1].begin=1;
node[1].num=0;
node[2].begin=2;
node[2].num=2;
node[3].begin=1;
node[3].num=0;
int edges[]={1,2,3,1};
bool frontier[4]={false};
bool visited[4]={false};
int custo[4]={0};
int source=0;
frontier[source]=true;
Node* Va;
cudaMalloc((void**)&Va,sizeof(Node)*4);
cudaMemcpy(Va,node,sizeof(Node)*4,cudaMemcpyHostToDevice);
int* Ea;
cudaMalloc((void**)&Ea,sizeof(Node)*4);
cudaMemcpy(Ea,edges,sizeof(Node)*4,cudaMemcpyHostToDevice);
bool* Fa;
cudaMalloc((void**)&Fa,sizeof(bool)*4);
cudaMemcpy(Fa,frontier,sizeof(bool)*4,cudaMemcpyHostToDevice);
bool* Xa;
cudaMalloc((void**)&Xa,sizeof(bool)*4);
cudaMemcpy(Xa,visited,sizeof(bool)*4,cudaMemcpyHostToDevice);
int* Ca;
cudaMalloc((void**)&Ca,sizeof(int)*4);
cudaMemcpy(Ca,custo,sizeof(int)*4,cudaMemcpyHostToDevice);
dim3 threads(4,1,1);
bool para;
bool* parada;
cudaMalloc((void**)¶da,sizeof(bool));
printf("\n");
int n=1;
do{
para=false;
cudaMemcpy(parada,¶,sizeof(bool),cudaMemcpyHostToDevice);
BFS <<<1,threads>>>(Va,Ea,Fa,Xa,Ca,parada);
CUT_CHECK_ERROR("kernel1 execution failed");
cudaMemcpy(¶,parada,sizeof(bool),cudaMemcpyDeviceToHost);
printf("Run number: %d >> ",n);
cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);
for(int i=0;i<4;i++)
printf("%d ",custo[i]);
printf("\n");
n++;
}while(para);
printf("\nFinal:\n");
cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);
for(int i=0;i<4;i++)
printf("%d ",custo[i]);
printf("\n");
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
该设备代码中存在许多相当重大的缺陷。首先,
Xa
和Ca
上都存在内存竞争。其次,您有一个有条件执行的 __syncthreads() 调用,这是非法的,如果由线程扭曲执行,则可能会导致内核挂起,而调用周围可能会发生任何分支分歧。即使您使用原子内存访问函数来消除所发布的代码中最糟糕的写入后读取竞争,您使用的算法结构在 CUDA 上也可能不正确。使用原子内存访问将有效地序列化代码并消耗大量性能。
CUDA 上的广度优先搜索并不是一个未解决的问题。如果您愿意搜索的话,有很多关于实现的好论文。如果您有的话,我会推荐高性能和可扩展的GPU图形遍历还没见过。这些作者的实现代码也可以从此处下载。
There are a number of pretty major flaws in that device code. Firstly, you have memory races on both
Xa
andCa
. Secondly, you have a conditionally executed__syncthreads()
call, which is illegal and can lead to the kernel hanging if executed by a warp of threads where any branch divergence around the call can occur.The structure of the algorithm you are using probably isn't going to be correct on CUDA, even if you were to use atomic memory access functions to eliminate the worst pf read-after-write races in the code as posted. Using atomic memory access will effectively serialise the code and cost a great deal of performance.
Breadth first search on CUDA isn't an unsolved problem. There are a number of good papers on implementations, if you care to search for them. I would recommend High Performance and Scalable GPU Graph Traversal, if you have not already seen it. The code for those authors' implementation is also available for download from here.