CUBLAS - 矩阵加法..如何?
我正在尝试使用 CUBLAS 对两个未知大小的大矩阵求和。我需要一个完全优化的代码(如果可能),所以我选择不重写矩阵加法代码(简单),而是使用 CUBLAS,特别是 cublasSgemm 函数,它允许对 A 和 C 求和(如果 B 是单位矩阵): *C = alpha*op(A)*op(B)+beta*c*
问题是:C 和 C++ 以行优先格式存储矩阵,cublasSgemm 旨在(为了 fortran 兼容性)以列优先格式工作。您可以指定是否先转置 A 和 B,但不能指定转置 C。所以我无法完成我的矩阵加法。我无法
自己转置 C 矩阵,因为该矩阵类似于最大尺寸为 20000x20000。
请问有什么办法可以解决吗?
I am trying to use CUBLAS to sum two big matrices of unknown size. I need a fully optimized code (if possible) so I chose not to rewrite the matrix addition code (simple) but using CUBLAS, in particular the cublasSgemm function which allows to sum A and C (if B is a unit matrix): *C = alpha*op(A)*op(B)+beta*c*
The problem is: C and C++ store the matrices in row-major format, cublasSgemm is intended (for fortran compatibility) to work in column-major format. You can specify whether A and B are to be transposed first, but you can NOT indicate to transpose C. So I'm unable to complete my matrix addition..
I can't transpose the C matrix by myself because the matrix is something like 20000x20000 maximum size.
Any idea on how to solve please?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您只是将矩阵相加,那么实际上并不重要。你给它 alpha、Aij、beta 和 Cij。它认为您正在给它 alpha、Aji、beta 和 Cji,并给您它认为的 Cji = beta Cji + alpha Aji。但就您而言,这是正确的 Cij。我担心的是当你开始做一些真正重要的事情时——比如矩阵产品。在那里,可能没有解决办法。
但更重要的是,您不想使用 GEMM 进行矩阵加法 - 您正在执行完全毫无意义的矩阵乘法(这需要大约 20,0003 运算,并且多次遍历内存)对于只需要 ~20,0002 次操作和一次传递的操作!将矩阵视为 20,000^2 长的向量并使用 saxpy。
矩阵乘法是内存带宽密集型的,因此您自己编码的版本与经过调整的版本之间存在巨大的性能差异(10 倍或 100 倍)。理想情况下,您应该更改代码中的结构以匹配库。如果不能,在这种情况下,您可以仅使用线性代数恒等式来进行管理。 C-vs-Fortran 排序意味着当您传入 A 时,CUBLAS“看到”AT(A 转置)。这很好,我们可以解决它。如果您想要的是 C=AB,请以相反的顺序传递矩阵 BA 。然后库看到 (BT . AT),并计算 CT = (AB)T;然后当它传回 CT 时,你会得到(按你的顺序)C。测试一下看看。
If you're just adding the matrices, it doesn't actually matter. You give it alpha, Aij, beta, and Cij. It thinks you're giving it alpha, Aji, beta, and Cji, and gives you what it thinks is Cji = beta Cji + alpha Aji. But that's the correct Cij as far as you're concerned. My worry is when you start going to things which do matter -- like matrix products. There, there's likely no working around it.
But more to the point, you don't want to be using GEMM to do matrix addition -- you're doing a completely pointless matrix multiplication (which takes takes ~20,0003 operations and many passes through memory) for an operatinon which should only require ~20,0002 operations and a single pass! Treat the matricies as 20,000^2-long vectors and use saxpy.
Matrix multiplication is memory-bandwidth intensive, so there is a huge (factors of 10x or 100x) difference in performance between coding it yourself and a tuned version. Ideally, you'd change structures in your code to match the library. If you can't, in this case you can manage just by using linear algebra identities. The C-vs-Fortran ordering means that when you pass in A, CUBLAS "sees" AT (A transpose). Which is fine, we can work around it. If what you want is C=A.B, pass in the matricies in the opposite order, B.A . Then the library sees (BT . AT), and calculates CT = (A.B)T; and then when it passes back CT, you get (in your ordering) C. Test it and see.
cublasgeam 已添加到 CUBLAS5.0 中。
它计算 2 个可选转置矩阵的加权和
cublasgeam has been added to CUBLAS5.0.
It computes the weighted sum of 2 optionally transposed matrices