用于有限元装配的 CUDA 内核
我们有一个非结构化四面体网格文件,包含以下格式:
element-ID nod1 nod2 nod3 nod4
1 452 3434 322 9000
2 2322 837 6673 2323
.
.
.
300000
我们对上述网格进行分区,每个分区大小为 2048。 对于每个大小为 2048 的分区包含唯一的 nod1 nod2 nod3 nod4 值,我们在不同的起始索引处传递 1 个块和 512 个线程。
在 cuda 文件中,我们
__global__ void calc(double d_ax,int *nod1,int *node2,int *nod3,int *nod4,int start,int size)
{
int n1,n2,n3,n4;
int i = blockIdx.x * blockDim.x + threadIdx.x + start;
if ( i < size )
{
n1=nod1[i];
n2=nod2[i];
n3=nod3[i];
n4=nod4[i];
ax[n1] += some code;
ax[n2] += some code;
ax[n3] += some code;
ax[n4] += some code;
}
}
调用内核,因为
calc<<<1,512>>>(d_ax,....,0,512);
calc<<<1,512>>>(d_ax,....,512,512);
calc<<<1,512>>>(d_ax,....,1024,512);
calc<<<1,512>>>(d_ax,....1536,512);
上面的代码运行良好,但问题是我们一次使用多个块得到不同的结果。例如:
calc<<<2,512>>>(d_ax,....,0,1024);
calc<<<2,512>>>(d_ax,....,1024,1024);
有人可以帮助我吗?
We have an unstructured tetrahedral mesh file containing following format:
element-ID nod1 nod2 nod3 nod4
1 452 3434 322 9000
2 2322 837 6673 2323
.
.
.
300000
We partitioned the above mesh for partition size of 2048 each.
For each partition size of 2048 contains unique nod1 nod2 nod3 nod4 values, we pass 1 block and 512 threads at different start index.
In a cuda file, we have
__global__ void calc(double d_ax,int *nod1,int *node2,int *nod3,int *nod4,int start,int size)
{
int n1,n2,n3,n4;
int i = blockIdx.x * blockDim.x + threadIdx.x + start;
if ( i < size )
{
n1=nod1[i];
n2=nod2[i];
n3=nod3[i];
n4=nod4[i];
ax[n1] += some code;
ax[n2] += some code;
ax[n3] += some code;
ax[n4] += some code;
}
}
We call the kernel as
calc<<<1,512>>>(d_ax,....,0,512);
calc<<<1,512>>>(d_ax,....,512,512);
calc<<<1,512>>>(d_ax,....,1024,512);
calc<<<1,512>>>(d_ax,....1536,512);
the above code works well but the problem is we get different results using more than one block at a time. For example:
calc<<<2,512>>>(d_ax,....,0,1024);
calc<<<2,512>>>(d_ax,....,1024,1024);
Can anyone help me?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不确定当您发布的代码不完整且无法编译时,您如何期望任何人告诉您可能出现的问题,但是如果在您的单块情况下您确实在调用内核,就像您发布的那样,这就是应该发生的情况:
因此,无论您的代码在使用多个块运行时是否可能被破坏,单块情况的结果都可能是错误的,因此您的问题的整个要点可能是无关紧要的。
如果您想要更好的答案,请编辑您的问题,使其包含问题的完整描述以及实际可以编译的简洁、完整的代码。否则,任何人都可以从您提供的信息中猜测到这一点。
I am not sure how you expect anyone to tell you what might be wrong when the code you have posted is incomplete and uncompilable, but if in your single block case you really are calling the kernel as you have posted, this is what should happen:
So irrespective of whether your code might be broken when run using multiple blocks, your results for the single block case are probably wrong, and the whole point of your question is probably irrelevant as a result.
If you want a better answer, edit your question so it contains a complete description of the problem and concise, complete code that could actually be compiled. Otherwise this is about as much as anybody could guess from the information you have provided.
但在两个分区集中可以出现相同的节点索引吗
如果在两个不同的块中您有
块 1:
ax[1234]=做某事
第 2 块:
ax[1234]=do something else
它闻起来像竞争条件。你永远不知道这两个块中哪一个写起来会更快......
but in two partition sets the same node index can appear?
If in two different blocks you have
Block 1:
ax[1234]=do something
Block 2:
ax[1234]=do something else
it smells as a race condition. You never know which one of the two blocks will be faster to write....