如何使用 blas 以最佳方式转置矩阵?
我正在做一些计算,并对不同 BLAS 实现的优势和弱点进行一些分析。但是我遇到了一个问题。
我正在测试 cuBlas,在 GPU 上执行 linAlg 似乎是一个好主意,但有一个问题。
cuBlas 实现使用列主格式,并且由于这不是我最终需要的,我很好奇是否有一种方法可以使 BLAS 进行矩阵转置?
I'm doing some calculations, and doing some analysis on the forces and weakness of different BLAS implementations. however I have come across a problem.
I'm testing cuBlas, doing linAlg on the GPU would seem like a good idea, but there is one problem.
The cuBlas implementation using column-major format, and since this is not what I need in the end, I'm curious if there is a way in with one can make BLAS do matrix-transpose?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
BLAS 没有内置的矩阵转置例程。CUDA SDK 包含一个矩阵转置示例以及一篇讨论执行转置的最佳策略的论文。您的最佳策略可能是使用 CUBLAS 的行主要输入以及调用的转置输入版本,然后在列主要中执行中间计算,最后使用 SDK 转置内核执行转置操作。
编辑补充说,CUBLAS 在 CUBLAS 版本 5 中添加了一个转置例程,
geam
,它可以在 GPU 内存中执行矩阵转置,并且应该被视为最适合您使用的任何架构。BLAS doesn't have a matrix transpose routine built in. The CUDA SDK includes a matrix transpose example with a paper which discusses optimal strategy for performing a transpose. Your best strategy is probably to use row major inputs to CUBLAS with the transpose input version of the calls, then perform the intermediate calculations in column major, and lastly perform a transpose operation afterwards using the SDK transpose kernel.
Edited to add that CUBLAS added a transpose routine in CUBLAS version 5,
geam
, which can performed matrix transposition in GPU memory and should be regarded as optimal for whatever architecture you are using.