CUDA 中的矩阵运算
在 CUDA 中组织矩阵运算的最佳方式是什么(就性能而言)? 例如,我想计算C * C^(-1) * B^T + C
,C
和B
是矩阵。
我应该为乘法、转置等编写单独的函数还是为整个表达式编写一个函数?
哪条路最快?
What is the best way to organize matrix operations in CUDA (in terms of performance)?
For example, I want to calculate C * C^(-1) * B^T + C
, C
and B
are matrices.
Should I write separate functions for multiplication, transposition and so on or write one function for the whole expression?
Which way is the fastest?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议您使用 CUBLAS 库。它通常比您自己编写的所有内容都更快捷、更可靠。此外,它的 API 类似于 BLAS 库,BLAS 库是数值线性代数的标准库。
I'd recommend you to use the CUBLAS library. It's normally much daster and more reliable than everything you could write on your own. In addition it's API is similar to the BLAS library which is the standard library for numerical linear algebra.
我认为答案很大程度上取决于矩阵的大小。
如果您可以在共享内存中容纳一个矩阵,我可能会使用单个块来计算它,并将所有内容放在单个内核中(可能更大,其中该计算只是其中的一部分)。希望如果您有更多矩阵,并且需要多次计算上述方程,您可以利用所有 GPU 计算能力并行执行。
然而,如果你的矩阵更大,你将需要更多的块来计算(检查 CUDA 手册中的矩阵乘法示例)。在继续方程的下一部分之前,您需要保证所有块都完成乘法,如果是这样,您将需要为每个操作调用内核。
I think the answer depends heavily on the size of your matrices.
If you can fit a matrix in shared memory, I would probably use a single block to compute that and have all inside a single kernel (probably bigger, where this computation is only a part of it). Hopefully, if you have more matrices, and you need to compute the above equation several times, you can do it in parallel, utilising all GPU computing power.
However, if your matrices are much bigger, you will want more blocks to compute that (check matrix multiplication example in CUDA manual). You need a guarantee that multiplication is finished by all blocks before you proceed with the next part of your equation, and if so, you will need a kernel call for each of your operations.