为什么《CUDA 编程指南》中 char3 的对齐大小为 1?
我发现 char1
的对齐方式为 1,char2
的对齐方式为 2,char3
的对齐方式为 1,char4
的对齐方式为4. 为什么 char3 的对齐大小是 1 而不是 3?
我还需要一些帮助来理解“对齐要求”的概念。
请参阅此CUDA编程指南(表附录 B 中 B.3.1 节的 B-1)
非常感谢,
Yik
I found that alignment for char1
is 1, for char2
is 2, for char3
is 1 and for char4
is 4.
Why is the alignment size for char3 is 1 instead of 3?
I also need some help to understand the concept of "Alignment Requirements".
Please refer to this CUDA Programming Guide (Table B-1 of Section B.3.1 in Appendix B)
Many thanks,
Yik
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
考虑一下如果您有一个
char3
数组(> 的对齐方式)会发生什么? 1 表示非连续数组。然而,对于 char4,您没有任何此类填充问题,并且您可能希望确保可以将 char4 视为 32 位 int,因此 4字节对齐很有帮助。
Think about what would happen if you had an array of
char3
- an alignment of > 1 would mean a non-contiguous array.For
char4
however you do not have any such padding problems, and you probably want to ensure that achar4
can be treated as if it were a 32 bit int, so a 4 byte alignment is helpful.对于像我这样仍然对这个问题感到困惑的人:
全局内存访问的大小始终为 1、2、4、8、16 字节等。所以3字节的数据必须通过填充来对齐才能达到高效率。如果是这样,那么
char3
实际上是一个char4
在幕后,我们正在使用额外的字节来提高效率。那么对齐方式就是4。而且还有一种方式,就是CUDA实际使用的方式。他们没有牺牲空间来换取时间,而是反其道而行之。
char3
本质上并不是char4
,而是 3 个char1
!使用 CUDA 11.7 和 CC 8.6,我编写了以下代码:
NVCC 生成的 ptx 代码如下:
我们可以看到对
char2
和char4
的操作被编译为单个指令为st.global.v2
、st.global.v4
,而char3
实际上被视为 3char1
。那么回到问题,为什么
char3
的对齐方式是1?因为实际上是3个char1
,不需要对齐。它不是为了填充,相反,如果我们在这里填充的话,对齐方式将是4。For anyone like me still getting confused about this problem:
The global memory access is always in size of 1, 2, 4, 8, 16 bytes and so on. So a 3-byte data must be padded to be aligned to achieve high efficiency. If so, then a
char3
is actually achar4
under the hood, and we are using an extra byte to improve efficiency. Then the alignment is 4.And there is also another way, which is the way CUDA actually used. Rather than sacrifice space for time, they do the opposite. A
char3
is not achar4
under the hood, instead, it is 3char1
!With CUDA 11.7 and CC 8.6, I wrote the code below:
NVCC produced the ptx code as below:
We can see that the operations to
char2
andchar4
are compiled to single instructions asst.global.v2
,st.global.v4
, while thechar3
is actually treated as 3char1
.So back to the question, why the alignment for
char3
is 1? Because it is 3char1
in fact, and no alignment is needed. It is not for padding, on the contrary, the alignment will be 4 if we do the padding here.