使用 CUDA 优化向量元素交换

发布于 2024-09-02 05:33:53 字数 1420 浏览 4 评论 0原文

因为我是cuda新手..我需要你的帮助 我有这个长向量,对于每组 24 个元素,我需要执行以下操作: 对于前 12 个元素,偶数元素乘以 -1, 对于第二个 12 个元素,奇数元素乘以 -1,然后进行以下交换:

图表:因为我还没有足够的点,所以无法发布图像,所以这里是:

http://www.freeimagehosting.net/image.php?e4b88fb666.png

我写过这段代码,想知道您是否可以帮助我进一步优化它以解决分歧或银行冲突..

//subvector is a multiple of 24, Mds and Nds are shared memory

____shared____ double Mds[subVector];

____shared____ double Nds[subVector];

int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int  basex = __umul24(blockDim.x, blockIdx.x);

 Mds[tx] = M.elements[basex + tx];
__syncthreads();

// flip the signs 
 if (tx < (tx/24)*24 + 12)
 {  
    //if < 12 and even
    if ((tx & 0x0001)==0)
    Mds[tx] = -Mds[tx];
 }
 else
 if (tx < (tx/24)*24 + 24)
 {
    //if >12 and < 24 and odd
    if ((tx & 0x0001)==1)
    Mds[tx] = -Mds[tx];
 }

 __syncthreads();

 if (tx < (tx/24)*24 + 6)
 {  
//for the first 6 elements .. swap with last six in the 24elements group (see graph)
    Nds[tx] = Mds[tx_mod + 18];
    Mds [tx_mod + 18] = Mds [tx];
    Mds[tx] = Nds[tx];
 }
 else
 if (tx < (tx/24)*24 + 12)
 {
    // for the second 6 elements .. swp with next adjacent group (see graph)
    Nds[tx] = Mds[tx_mod + 6];
    Mds [tx_mod + 6] = Mds [tx];
    Mds[tx] = Nds[tx];
 }   
__syncthreads();

提前致谢..

Since I am new to cuda .. I need your kind help
I have this long vector, for each group of 24 elements, I need to do the following:
for the first 12 elements, the even numbered elements are multiplied by -1,
for the second 12 elements, the odd numbered elements are multiplied by -1 then the following swap takes place:

Graph: because I don't yet have enough points, I couldn't post the image so here it is:

http://www.freeimagehosting.net/image.php?e4b88fb666.png

I have written this piece of code, and wonder if you could help me further optimize it to solve for divergence or bank conflicts ..

//subvector is a multiple of 24, Mds and Nds are shared memory

____shared____ double Mds[subVector];

____shared____ double Nds[subVector];

int tx = threadIdx.x;
int tx_mod = tx ^ 0x0001;
int  basex = __umul24(blockDim.x, blockIdx.x);

 Mds[tx] = M.elements[basex + tx];
__syncthreads();

// flip the signs 
 if (tx < (tx/24)*24 + 12)
 {  
    //if < 12 and even
    if ((tx & 0x0001)==0)
    Mds[tx] = -Mds[tx];
 }
 else
 if (tx < (tx/24)*24 + 24)
 {
    //if >12 and < 24 and odd
    if ((tx & 0x0001)==1)
    Mds[tx] = -Mds[tx];
 }

 __syncthreads();

 if (tx < (tx/24)*24 + 6)
 {  
//for the first 6 elements .. swap with last six in the 24elements group (see graph)
    Nds[tx] = Mds[tx_mod + 18];
    Mds [tx_mod + 18] = Mds [tx];
    Mds[tx] = Nds[tx];
 }
 else
 if (tx < (tx/24)*24 + 12)
 {
    // for the second 6 elements .. swp with next adjacent group (see graph)
    Nds[tx] = Mds[tx_mod + 6];
    Mds [tx_mod + 6] = Mds [tx];
    Mds[tx] = Nds[tx];
 }   
__syncthreads();

Thanks in advance ..

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

我一向站在原地 2024-09-09 05:33:53

保罗在之前的问题中为您提供了很好的起点。

需要注意的几件事:您正在进行非基数 2 除法,这是昂贵的。
相反,尝试利用线程块的多维性质。例如,将 x 维度设置为 24,这样就无需进行除法。

一般来说,尝试调整线程块尺寸以反映您的数据尺寸。

简化符号翻转:例如,如果你不想翻转符号,你仍然可以乘以恒等号1。弄清楚如何仅使用算术将偶数/奇数映射到 1 和 -1:例如 sign = (even*2+1) - 2,其中 Even 是 1 或 0。

paul gave you pretty good starting points you previous questions.

couple things to watch out for: you are doing non-base 2 division which is expensive.
Instead try to utilize multidimensional nature of the thread block. For example, make the x-dimension of size 24, which will eliminate need for division.

in general, try to fit thread block dimensions to reflect your data dimensions.

simplify sign flipping: for example, if you do not want to flip sign, you can still multiplied by identity 1. Figure out how to map even/odd numbers to 1 and -1 using just arithmetic: for example sign = (even*2+1) - 2 where even is either 1 or 0.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文