如何确定此写访问是否已合并?
如何确定以下内存访问是否合并:
// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;
while ( idx < NUMELEMENTS )
{
// Do Something
// ....
// Write to Array which contains results of calculations
results[ idx ] = df2;
// Next Element
idx += offset;
}
NUMELMENTS
是要处理的单个数据元素的完整数量。数组 results
作为指针传递给内核函数,并在全局内存中分配。
我的问题: results[ idx ] = df2;
行中的写访问是否合并?
我相信这是因为每个线程处理连续的索引项,但我并不完全确定它&我不知道该怎么说。
谢谢!
How can I determine if the following memory access is coalesced or not:
// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;
while ( idx < NUMELEMENTS )
{
// Do Something
// ....
// Write to Array which contains results of calculations
results[ idx ] = df2;
// Next Element
idx += offset;
}
NUMELEMENTS
is the complete number of single dataelements to process. The array results
is passed as pointer to the kernel function and allocated before in global memory.
My Question: Is the write access in the line results[ idx ] = df2;
coalesced?
I believe it is as each thread processes consecutive indexed items but I'm not completely sure about it & I don't know how to tell.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
取决于矩阵行的长度是计算能力 1.x 的设备的扭曲大小一半的倍数还是计算能力 2.x 的设备的扭曲大小的倍数。如果不是,您可以使用填充使其完全合并。函数 cudaMallocPitch 可用于此目的。
编辑:
抱歉造成混乱。您一次写入“偏移”元素,我将其解释为矩阵的行。
我的意思是,在循环的每次迭代之后,您都会增加 idx 的偏移量。如果偏移量是计算能力 1.x 的设备的扭曲大小一半的倍数,或者计算能力 2.x 的设备的扭曲大小的倍数,那么它会被合并,如果不是,那么您需要填充来实现。
可能它已经合并了,因为您应该选择每个块的线程数,因此 blockDim 作为扭曲大小的倍数。
Depends if the length of the lines of your matrix is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x. If it is not you can use padding to make it fully coalesced. The function cudaMallocPitch can be used for this purpose.
edit:
Sorry for the confusion. You write 'offset' elements at a time which I interpreted as lines of a matrix.
What I mean is, after each iteration of your cycle you increase the idx by offset. If offset is a multiple of half the warp size for devices of compute capability 1.x or a multiple of the warp size for devices of compute capability 2.x then you it is coalesced, if not then you need padding to make it so.
Probably it is already coalesced because you should choose the number of threads per block and thus the blockDim as a multiple of the warp size.