CUDA:块的更多维度还是只有一个?
我需要使用 CUDA 对矩阵的每个元素(基本上是内存中的浮点值向量)求平方根。
矩阵维数不是“先验”已知的,并且可能会有所不同 [2-20.000]。
我想知道:我可能会使用(正如乔纳森在这里建议的那样)这样的一个块维度:
int thread_id = blockDim.x * block_id + threadIdx.x;
并检查 thread_id 是否低于行*列...这非常简单直接。
但是,是否有任何特定的性能原因,为什么我应该使用两个(甚至三个)块网格维度来执行这样的计算(记住我毕竟有一个矩阵)而不是一个?
我正在考虑合并问题,例如让所有线程按顺序读取值
I need to square root each element of a matrix (which is basically a vector of float values once in memory) using CUDA.
Matrix dimensions are not known 'a priori' and may vary [2-20.000].
I was wondering: I might use (as Jonathan suggested here) one block dimension like this:
int thread_id = blockDim.x * block_id + threadIdx.x;
and check for thread_id lower than rows*columns... that's pretty simple and straight.
But is there any particular performance reason why should I use two (or even three) block grid dimensions to perform such a calculation (keeping in mind that I have a matrix afterall) instead of just one?
I'm thinking at coalescence problems, like making all threads reading values sequentially
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
维度只是为了方便而存在,内部一切都是线性的,因此无论哪种方式在效率方面都没有优势。如上面所示,避免计算(人为的)线性索引似乎会更快一些,但线程合并的方式不会有任何差异。
The dimensions only exist for convenience, internally everything is linear, so there would be no advantage in terms of efficiency either way. Avoiding the computation of the (contrived) linear index as you've shown above would seem to be a bit faster, but there wouldn't be any difference in how the threads coalesce.