用于有限元装配的 CUDA 内核

发布于 2024-12-18 08:39:12 字数 1234 浏览 4 评论 0原文

我们有一个非结构化四面体网格文件，包含以下格式：

element-ID  nod1 nod2 nod3 nod4


    1            452  3434  322 9000

    2            2322   837 6673 2323

    .
    .
    .

300000

我们对上述网格进行分区，每个分区大小为 2048。对于每个大小为 2048 的分区包含唯一的 nod1 nod2 nod3 nod4 值，我们在不同的起始索引处传递 1 个块和 512 个线程。

在 cuda 文件中，我们

__global__ void calc(double d_ax,int *nod1,int *node2,int *nod3,int *nod4,int   start,int size)
{
    int n1,n2,n3,n4;     
    int i = blockIdx.x * blockDim.x + threadIdx.x + start;


    if ( i < size )
    {

        n1=nod1[i];
        n2=nod2[i];
        n3=nod3[i];
        n4=nod4[i];

        ax[n1] += some code;
        ax[n2] += some code;
        ax[n3] += some code;
        ax[n4] += some code;
    }
}

调用内核，因为

calc<<<1,512>>>(d_ax,....,0,512);
calc<<<1,512>>>(d_ax,....,512,512);
calc<<<1,512>>>(d_ax,....,1024,512); 
calc<<<1,512>>>(d_ax,....1536,512);

上面的代码运行良好，但问题是我们一次使用多个块得到不同的结果。例如：

calc<<<2,512>>>(d_ax,....,0,1024); 
calc<<<2,512>>>(d_ax,....,1024,1024);

有人可以帮助我吗？

原文

We have an unstructured tetrahedral mesh file containing following format:

element-ID  nod1 nod2 nod3 nod4


    1            452  3434  322 9000

    2            2322   837 6673 2323

    .
    .
    .

300000

We partitioned the above mesh for partition size of 2048 each.
For each partition size of 2048 contains unique nod1 nod2 nod3 nod4 values, we pass 1 block and 512 threads at different start index.

In a cuda file, we have

__global__ void calc(double d_ax,int *nod1,int *node2,int *nod3,int *nod4,int   start,int size)
{
    int n1,n2,n3,n4;     
    int i = blockIdx.x * blockDim.x + threadIdx.x + start;


    if ( i < size )
    {

        n1=nod1[i];
        n2=nod2[i];
        n3=nod3[i];
        n4=nod4[i];

        ax[n1] += some code;
        ax[n2] += some code;
        ax[n3] += some code;
        ax[n4] += some code;
    }
}

We call the kernel as

calc<<<1,512>>>(d_ax,....,0,512);
calc<<<1,512>>>(d_ax,....,512,512);
calc<<<1,512>>>(d_ax,....,1024,512); 
calc<<<1,512>>>(d_ax,....1536,512);

the above code works well but the problem is we get different results using more than one block at a time. For example:

calc<<<2,512>>>(d_ax,....,0,1024); 
calc<<<2,512>>>(d_ax,....,1024,1024);

Can anyone help me?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

奈何桥上唱咆哮 2024-12-25 08:39:12

我不确定当您发布的代码不完整且无法编译时，您如何期望任何人告诉您可能出现的问题，但是如果在您的单块情况下您确实在调用内核，就像您发布的那样，这就是应该发生的情况：

calc<<<1,512>>>(d_ax,....,0,512);    // process first 512 elements
calc<<<1,512>>>(d_ax,....,512,512);  // start >= 512, size == 512, does nothing
calc<<<1,512>>>(d_ax,....,1024,512); // start >= 1024, size == 512, does nothing
calc<<<1,512>>>(d_ax,....1536,512);  // start >= 1536, size == 512, does nothing

因此，无论您的代码在使用多个块运行时是否可能被破坏，单块情况的结果都可能是错误的，因此您的问题的整个要点可能是无关紧要的。

如果您想要更好的答案，请编辑您的问题，使其包含问题的完整描述以及实际可以编译的简洁、完整的代码。否则，任何人都可以从您提供的信息中猜测到这一点。

I am not sure how you expect anyone to tell you what might be wrong when the code you have posted is incomplete and uncompilable, but if in your single block case you really are calling the kernel as you have posted, this is what should happen:

calc<<<1,512>>>(d_ax,....,0,512);    // process first 512 elements
calc<<<1,512>>>(d_ax,....,512,512);  // start >= 512, size == 512, does nothing
calc<<<1,512>>>(d_ax,....,1024,512); // start >= 1024, size == 512, does nothing
calc<<<1,512>>>(d_ax,....1536,512);  // start >= 1536, size == 512, does nothing

So irrespective of whether your code might be broken when run using multiple blocks, your results for the single block case are probably wrong, and the whole point of your question is probably irrelevant as a result.

If you want a better answer, edit your question so it contains a complete description of the problem and concise, complete code that could actually be compiled. Otherwise this is about as much as anybody could guess from the information you have provided.

回复收藏 0 原文