当奇数/偶数线程在 CUDA 中执行不同操作时优化代码性能

发布于 2024-09-02 03:32:00 字数 589 浏览 6 评论 0原文

我有两个大向量，我正在尝试进行某种元素乘法，其中第一个向量中的偶数元素乘以第二个向量中的下一个奇数元素......并且其中奇数元素第一个向量中的元素乘以第二个向量中的前一个偶数元素。

例如：

向量1为V1(1) V1(2) V1(3) V1(4)
向量 2 是 V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)

我已经编写了 Cuda 代码来执行此操作（Pds 在共享内存中具有第一个向量的元素，Nds 第二个向量）：

// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster 

if ((tx & 0x0001) == 0x0000)
    Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
    Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();

是否有办法进一步加速此代码或避免发散？

原文

I have two large vectors, I am trying to do some sort of element multiplication, where an even-numbered element in the first vector is multiplied by the next odd-numbered element in the second vector... and where the odd-numbered element in the first vector is multiplied by the preceding even-numbered element in the second vector.

For example:

vector 1 is V1(1) V1(2) V1(3) V1(4)
vector 2 is V2(1) V2(2) V2(3) V2(4)
V1(1) * V2(2)
V1(3) * V2(4)
V1(2) * V2(1)
V1(4) * V2(3)

I have written Cuda code to do this (Pds has the elements of the first vector in shared memory, Nds the second Vector):

// instead of % 2, checking the first bit to decide if a number
// is odd/even is faster 

if ((tx & 0x0001) == 0x0000)
    Nds[tx+1] = Pds[tx] * Nds[tx+1];
else
    Nds[tx-1] = Pds[tx] * Nds[tx-1];
__syncthreads();

Is there anyway to further accelerate this code or avoid divergence?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╰ゝ天使的微笑 2024-09-09 03:32:00

您应该能够像这样消除分支：

int tx_index = tx ^ 1; // equivalent to: tx_index = (tx & 1) ? tx - 1 : tx + 1
Nds[tx_index] = Pds[tx] * Nds[tx_index];

You should be able to eliminate the branch like this:

int tx_index = tx ^ 1; // equivalent to: tx_index = (tx & 1) ? tx - 1 : tx + 1
Nds[tx_index] = Pds[tx] * Nds[tx_index];

回复收藏 0 原文

千纸鹤 2024-09-09 03:32:00

这是一篇旧帖子，可能有人觉得我的答案有用。如果在您的代码中 tx 是 threadIdx，那么您就存在分支或扭曲分歧。您必须避免块中的分歧，因为它会序列化流程。这意味着具有偶数索引的线程将首先运行，然后具有奇数索引的线程将运行。如果 tx 是 threadIdx，请尝试更改算法，使分支取决于 blockIdx。

回复收藏 0 原文

~没有更多了~