当奇数/偶数线程在 CUDA 中执行不同操作时优化代码性能
我有两个大向量,我正在尝试进行某种元素乘法,其中第一个向量中的偶数元素乘以第二个向量中的下一个奇数元素......并且其中奇数元素第一个向量中的元素乘以第二个向量中的前一个偶数元素。
例如:
向量1为V1(1) V1(2) V1(3) V1(4)
向量 2 是 V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)
我已经编写了 Cuda 代码来执行此操作(Pds 在共享内存中具有第一个向量的元素,Nds 第二个向量):
// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster
if ((tx & 0x0001) == 0x0000)
Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();
是否有办法进一步加速此代码或避免发散?
I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector.
For example:
vector 1 is V1(1) V1(2) V1(3) V1(4)
vector 2 is V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)
I have written Cuda code to do this (Pds has the elements of the first vector in shared memory, Nds the second Vector):
// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster
if ((tx & 0x0001) == 0x0000)
Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();
Is there anyway to further accelerate this code or avoid divergence?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您应该能够像这样消除分支:
You should be able to eliminate the branch like this:
这是一篇旧帖子,可能有人觉得我的答案有用。如果在您的代码中 tx 是 threadIdx,那么您就存在分支或扭曲分歧。您必须避免块中的分歧,因为它会序列化流程。这意味着具有偶数索引的线程将首先运行,然后具有奇数索引的线程将运行。如果 tx 是 threadIdx,请尝试更改算法,使分支取决于 blockIdx。
This is an old post, may be someone finds my answer useful. If in your code tx is threadIdx, then you have branching or warp divergence. You must avoid divergence in blocks, because it serializes the process. It means that the threads with even indices will run first, and then threads with odd indices will run. If tx is threadIdx, try to change your algorithm such that branching depends on blockIdx.