错误:CUDA 同步上的 BFS

发布于 2024-12-10 16:21:38 字数 2927 浏览 0 评论 0原文

我的以下代码出现错误,当它运行时,一些图权重被覆盖,但是 Xa 数组(它保留已访问过的数组)和 __syncthreads() 函数不应该发生这种情况...有人帮忙吗?

struct Node 
{
    int begin;     // begining of the substring
    int num;    // size of the sub-string 
};

__global__ void BFS (Node *Va, int *Ea, bool *Fa, bool *Xa, int *Ca, bool *parada)
{
    int tid = threadIdx.x;

    if (Fa[tid] == true && Xa[tid] == false)
    {
        Fa[tid] = false; 
        __syncthreads();

        // Va begin is where it's edges' subarray begins, Va is it's
        // number of elements
        for (int i = Va[tid].begin;  i < (Va[tid].begin + Va[tid].num); i++) 
        {           
            int nid = Ea[i];

            if (Xa[nid] == false)
            {
                Ca[nid] = Ca[tid] + 1;
                Fa[nid] = true;
                *parada = true;
            }   
        }    
        Xa[tid] = true;             
    }
}

// The BFS frontier corresponds to all the nodes being processed 
// at the current level.
int main()
{

    //descrição do grafo
    struct Node node[4]; 
    node[0].begin=0; 
    node[0].num=2; 
    node[1].begin=1; 
    node[1].num=0; 
    node[2].begin=2; 
    node[2].num=2; 
    node[3].begin=1; 
    node[3].num=0; 
    int edges[]={1,2,3,1}; 

    bool frontier[4]={false}; 
    bool visited[4]={false}; 
    int custo[4]={0}; 

    int source=0; 
    frontier[source]=true; 

    Node* Va; 
    cudaMalloc((void**)&Va,sizeof(Node)*4); 
    cudaMemcpy(Va,node,sizeof(Node)*4,cudaMemcpyHostToDevice); 

    int* Ea; 
    cudaMalloc((void**)&Ea,sizeof(Node)*4); 
    cudaMemcpy(Ea,edges,sizeof(Node)*4,cudaMemcpyHostToDevice); 

    bool* Fa; 
    cudaMalloc((void**)&Fa,sizeof(bool)*4); 
    cudaMemcpy(Fa,frontier,sizeof(bool)*4,cudaMemcpyHostToDevice); 

    bool* Xa; 
    cudaMalloc((void**)&Xa,sizeof(bool)*4); 
    cudaMemcpy(Xa,visited,sizeof(bool)*4,cudaMemcpyHostToDevice); 

    int* Ca; 
    cudaMalloc((void**)&Ca,sizeof(int)*4); 
    cudaMemcpy(Ca,custo,sizeof(int)*4,cudaMemcpyHostToDevice); 

    dim3 threads(4,1,1); 

    bool para; 
    bool* parada; 
    cudaMalloc((void**)&parada,sizeof(bool)); 
    printf("\n");
    int n=1;
    do{ 
        para=false; 
        cudaMemcpy(parada,&para,sizeof(bool),cudaMemcpyHostToDevice);       
        BFS <<<1,threads>>>(Va,Ea,Fa,Xa,Ca,parada);     
        CUT_CHECK_ERROR("kernel1 execution failed"); 
        cudaMemcpy(&para,parada,sizeof(bool),cudaMemcpyDeviceToHost); 



        printf("Run number: %d >> ",n); 
        cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);  
        for(int i=0;i<4;i++) 
            printf("%d  ",custo[i]); 
        printf("\n");
        n++;

    }while(para); 


    printf("\nFinal:\n");
    cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost); 

    for(int i=0;i<4;i++) 
        printf("%d  ",custo[i]); 
    printf("\n");

}

My following code got an error, when it runs, some of the graph weights are being overwritten, but that should not be happening with the Xa array(which keeps which ones have already been visited) and the __syncthreads() function... May someone help?

struct Node 
{
    int begin;     // begining of the substring
    int num;    // size of the sub-string 
};

__global__ void BFS (Node *Va, int *Ea, bool *Fa, bool *Xa, int *Ca, bool *parada)
{
    int tid = threadIdx.x;

    if (Fa[tid] == true && Xa[tid] == false)
    {
        Fa[tid] = false; 
        __syncthreads();

        // Va begin is where it's edges' subarray begins, Va is it's
        // number of elements
        for (int i = Va[tid].begin;  i < (Va[tid].begin + Va[tid].num); i++) 
        {           
            int nid = Ea[i];

            if (Xa[nid] == false)
            {
                Ca[nid] = Ca[tid] + 1;
                Fa[nid] = true;
                *parada = true;
            }   
        }    
        Xa[tid] = true;             
    }
}

// The BFS frontier corresponds to all the nodes being processed 
// at the current level.
int main()
{

    //descrição do grafo
    struct Node node[4]; 
    node[0].begin=0; 
    node[0].num=2; 
    node[1].begin=1; 
    node[1].num=0; 
    node[2].begin=2; 
    node[2].num=2; 
    node[3].begin=1; 
    node[3].num=0; 
    int edges[]={1,2,3,1}; 

    bool frontier[4]={false}; 
    bool visited[4]={false}; 
    int custo[4]={0}; 

    int source=0; 
    frontier[source]=true; 

    Node* Va; 
    cudaMalloc((void**)&Va,sizeof(Node)*4); 
    cudaMemcpy(Va,node,sizeof(Node)*4,cudaMemcpyHostToDevice); 

    int* Ea; 
    cudaMalloc((void**)&Ea,sizeof(Node)*4); 
    cudaMemcpy(Ea,edges,sizeof(Node)*4,cudaMemcpyHostToDevice); 

    bool* Fa; 
    cudaMalloc((void**)&Fa,sizeof(bool)*4); 
    cudaMemcpy(Fa,frontier,sizeof(bool)*4,cudaMemcpyHostToDevice); 

    bool* Xa; 
    cudaMalloc((void**)&Xa,sizeof(bool)*4); 
    cudaMemcpy(Xa,visited,sizeof(bool)*4,cudaMemcpyHostToDevice); 

    int* Ca; 
    cudaMalloc((void**)&Ca,sizeof(int)*4); 
    cudaMemcpy(Ca,custo,sizeof(int)*4,cudaMemcpyHostToDevice); 

    dim3 threads(4,1,1); 

    bool para; 
    bool* parada; 
    cudaMalloc((void**)¶da,sizeof(bool)); 
    printf("\n");
    int n=1;
    do{ 
        para=false; 
        cudaMemcpy(parada,¶,sizeof(bool),cudaMemcpyHostToDevice);       
        BFS <<<1,threads>>>(Va,Ea,Fa,Xa,Ca,parada);     
        CUT_CHECK_ERROR("kernel1 execution failed"); 
        cudaMemcpy(¶,parada,sizeof(bool),cudaMemcpyDeviceToHost); 



        printf("Run number: %d >> ",n); 
        cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost);  
        for(int i=0;i<4;i++) 
            printf("%d  ",custo[i]); 
        printf("\n");
        n++;

    }while(para); 


    printf("\nFinal:\n");
    cudaMemcpy(custo,Ca,sizeof(int)*4,cudaMemcpyDeviceToHost); 

    for(int i=0;i<4;i++) 
        printf("%d  ",custo[i]); 
    printf("\n");

}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夏末 2024-12-17 16:21:38

该设备代码中存在许多相当重大的缺陷。首先,XaCa 上都存在内存竞争。其次,您有一个有条件执行的 __syncthreads() 调用,这是非法的,如果由线程扭曲执行,则可能会导致内核挂起,而调用周围可能会发生任何分支分歧。

即使您使用原子内存访问函数来消除所发布的代码中最糟糕的写入后读取竞争,您使用的算法结构在 CUDA 上也可能不正确。使用原子内存访问将有效地序列化代码并消耗大量性能。

CUDA 上的广度优先搜索并不是一个未解决的问题。如果您愿意搜索的话,有很多关于实现的好论文。如果您有的话,我会推荐高性能和可扩展的GPU图形遍历还没见过。这些作者的实现代码也可以从此处下载。

There are a number of pretty major flaws in that device code. Firstly, you have memory races on both Xa and Ca. Secondly, you have a conditionally executed __syncthreads() call, which is illegal and can lead to the kernel hanging if executed by a warp of threads where any branch divergence around the call can occur.

The structure of the algorithm you are using probably isn't going to be correct on CUDA, even if you were to use atomic memory access functions to eliminate the worst pf read-after-write races in the code as posted. Using atomic memory access will effectively serialise the code and cost a great deal of performance.

Breadth first search on CUDA isn't an unsolved problem. There are a number of good papers on implementations, if you care to search for them. I would recommend High Performance and Scalable GPU Graph Traversal, if you have not already seen it. The code for those authors' implementation is also available for download from here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文