为什么《CUDA 编程指南》中 char3 的对齐大小为 1?

发布于 2024-12-09 12:58:35 字数 373 浏览 1 评论 0原文

我发现 char1 的对齐方式为 1,char2 的对齐方式为 2,char3 的对齐方式为 1,char4 的对齐方式为4. 为什么 char3 的对齐大小是 1 而不是 3?

我还需要一些帮助来理解“对齐要求”的概念。

请参阅此CUDA编程指南(表附录 B 中 B.3.1 节的 B-1)

非常感谢,

Yik

I found that alignment for char1 is 1, for char2 is 2, for char3 is 1 and for char4 is 4.
Why is the alignment size for char3 is 1 instead of 3?

I also need some help to understand the concept of "Alignment Requirements".

Please refer to this CUDA Programming Guide (Table B-1 of Section B.3.1 in Appendix B)

Many thanks,

Yik

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

北方的巷 2024-12-16 12:58:35

考虑一下如果您有一个 char3 数组(> 的对齐方式)会发生什么? 1 表示非连续数组。

然而,对于 char4,您没有任何此类填充问题,并且您可能希望确保可以将 char4 视为 32 位 int,因此 4字节对齐很有帮助。

Think about what would happen if you had an array of char3 - an alignment of > 1 would mean a non-contiguous array.

For char4 however you do not have any such padding problems, and you probably want to ensure that a char4 can be treated as if it were a 32 bit int, so a 4 byte alignment is helpful.

狂之美人 2024-12-16 12:58:35

对于像我这样仍然对这个问题感到困惑的人:
全局内存访问的大小始终为 1、2、4、8、16 字节等。所以3字节的数据必须通过填充来对齐才能达到高效率。如果是这样,那么 char3 实际上是一个 char4 在幕后,我们正在使用额外的字节来提高效率。那么对齐方式就是4。

而且还有一种方式,就是CUDA实际使用的方式。他们没有牺牲空间来换取时间,而是反其道而行之。 char3 本质上并不是 char4,而是 3 个 char1

使用 CUDA 11.7 和 CC 8.6,我编写了以下代码:

__device__ char2 ch2[100];
__device__ char3 ch3[100];
__device__ char4 ch4[100];
__global__ void test(char3 chval3,char2 chval2,char4 chval4){
    auto x=threadIdx.x;
    ch3[x]=chval3;
    ch2[x]=chval2;
    ch4[x]=chval4;
}

NVCC 生成的 ptx 代码如下:

ld.param.u8     %rs1, [_Z4test5char35char25char4_param_0+2];
ld.param.u8     %rs2, [_Z4test5char35char25char4_param_0+1];
ld.param.u8     %rs3, [_Z4test5char35char25char4_param_0];
mov.u32     %r1, %tid.x;
mul.wide.u32    %rd1, %r1, 3;
mov.u64     %rd2, ch3;
add.s64     %rd3, %rd2, %rd1;
st.global.u8    [%rd3], %rs3;
st.global.u8    [%rd3+1], %rs2;
st.global.u8    [%rd3+2], %rs1;
mul.wide.u32    %rd4, %r1, 2;
mov.u64     %rd5, ch2;
add.s64     %rd6, %rd5, %rd4;
ld.param.u8     %rs4, [_Z4test5char35char25char4_param_1+1];
ld.param.u8     %rs5, [_Z4test5char35char25char4_param_1];
st.global.v2.u8     [%rd6], {%rs5, %rs4};
mul.wide.u32    %rd7, %r1, 4;
mov.u64     %rd8, ch4;
add.s64     %rd9, %rd8, %rd7;
ld.param.u8     %rs6, [_Z4test5char35char25char4_param_2+3];
ld.param.u8     %rs7, [_Z4test5char35char25char4_param_2+2];
ld.param.u8     %rs8, [_Z4test5char35char25char4_param_2+1];
ld.param.u8     %rs9, [_Z4test5char35char25char4_param_2];
st.global.v4.u8     [%rd9], {%rs9, %rs8, %rs7, %rs6};
ret;

我们可以看到对 char2char4 的操作被编译为单个指令为 st.global.v2st.global.v4,而 char3 实际上被视为 3 char1

那么回到问题,为什么char3的对齐方式是1?因为实际上是3个char1,不需要对齐。它不是为了填充,相反,如果我们在这里填充的话,对齐方式将是4。

For anyone like me still getting confused about this problem:
The global memory access is always in size of 1, 2, 4, 8, 16 bytes and so on. So a 3-byte data must be padded to be aligned to achieve high efficiency. If so, then a char3 is actually a char4 under the hood, and we are using an extra byte to improve efficiency. Then the alignment is 4.

And there is also another way, which is the way CUDA actually used. Rather than sacrifice space for time, they do the opposite. A char3 is not a char4 under the hood, instead, it is 3 char1!

With CUDA 11.7 and CC 8.6, I wrote the code below:

__device__ char2 ch2[100];
__device__ char3 ch3[100];
__device__ char4 ch4[100];
__global__ void test(char3 chval3,char2 chval2,char4 chval4){
    auto x=threadIdx.x;
    ch3[x]=chval3;
    ch2[x]=chval2;
    ch4[x]=chval4;
}

NVCC produced the ptx code as below:

ld.param.u8     %rs1, [_Z4test5char35char25char4_param_0+2];
ld.param.u8     %rs2, [_Z4test5char35char25char4_param_0+1];
ld.param.u8     %rs3, [_Z4test5char35char25char4_param_0];
mov.u32     %r1, %tid.x;
mul.wide.u32    %rd1, %r1, 3;
mov.u64     %rd2, ch3;
add.s64     %rd3, %rd2, %rd1;
st.global.u8    [%rd3], %rs3;
st.global.u8    [%rd3+1], %rs2;
st.global.u8    [%rd3+2], %rs1;
mul.wide.u32    %rd4, %r1, 2;
mov.u64     %rd5, ch2;
add.s64     %rd6, %rd5, %rd4;
ld.param.u8     %rs4, [_Z4test5char35char25char4_param_1+1];
ld.param.u8     %rs5, [_Z4test5char35char25char4_param_1];
st.global.v2.u8     [%rd6], {%rs5, %rs4};
mul.wide.u32    %rd7, %r1, 4;
mov.u64     %rd8, ch4;
add.s64     %rd9, %rd8, %rd7;
ld.param.u8     %rs6, [_Z4test5char35char25char4_param_2+3];
ld.param.u8     %rs7, [_Z4test5char35char25char4_param_2+2];
ld.param.u8     %rs8, [_Z4test5char35char25char4_param_2+1];
ld.param.u8     %rs9, [_Z4test5char35char25char4_param_2];
st.global.v4.u8     [%rd9], {%rs9, %rs8, %rs7, %rs6};
ret;

We can see that the operations to char2 and char4 are compiled to single instructions as st.global.v2, st.global.v4, while the char3 is actually treated as 3 char1.

So back to the question, why the alignment for char3 is 1? Because it is 3 char1 in fact, and no alignment is needed. It is not for padding, on the contrary, the alignment will be 4 if we do the padding here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文