CUBLAS - 矩阵元素求幂可能吗?
我正在使用 CUBLAS(Cuda Blas 库)进行矩阵运算。
是否可以使用 CUBLAS 来实现矩阵项的幂/均方根?
我的意思是,拥有 2x2 矩阵
1 4
9 16
我想要的是一个提升到给定值(例如 2)
1 16
81 256
并计算均方根(例如)的
1 2
3 4
函数,这可以用 CUBLAS 实现吗?我找不到适合此目标的函数,但我会首先在这里询问以开始编写我自己的内核。
I'm using CUBLAS (Cuda Blas libraries) for matrix operations.
Is possible to use CUBLAS to achieve the exponentiation/root mean square of a matrix items?
I mean, having the 2x2 matrix
1 4
9 16
What I want is a function to elevate to a given value e.g. 2
1 16
81 256
and calculating the root mean square e.g.
1 2
3 4
Is this possible with CUBLAS? I can't find a function suitable to this goal, but I'll ask here first to begin coding my own kernel.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
因此,这很可能是您必须自己实现的事情,因为库不会为您做这件事。 (可能有某种方法可以用 BLAS 3 级例程来实现其中一些 - 当然是矩阵元素的平方 - 但它会涉及昂贵且不必要的矩阵向量乘法。而且我仍然不知道你是如何做到的d 进行平方根运算)。原因是这些运算并不是真正的线性代数过程;而是线性代数过程。对每个矩阵元素求平方根并不真正对应于任何基本的线性代数运算。
好消息是这些元素操作在 CUDA 中实现起来非常简单。同样,有很多调整选项可以使用以获得最佳性能,但可以相当容易地开始。
与矩阵加法运算一样,您将在这里将 NxM 矩阵视为 (N*M) 长度的向量;矩阵的结构对于这些元素运算并不重要。因此,您将传递一个指向矩阵第一个元素的指针,并将其视为 N*M 数字的单个列表。 (我假设您在这里使用
float
,就像您之前讨论SGEMM
和SAXPY
一样。)内核,实现该操作的实际 CUDA 代码非常简单。目前,每个线程将计算一个数组元素的平方(或平方根)。 (您可以测试这对于性能是否是最佳的)。所以内核看起来像下面这样。我假设你正在做类似 B_ij = (A_ij)^2; 的事情如果您想就地执行操作,例如 A_ij = (A_ij)^2,您也可以这样做:
请注意,如果您可以接受稍微增加的误差,则“sqrtf()”函数的最大误差为3 ulp(最后一位的单位)明显更快。
如何调用这些内核将取决于您执行操作的顺序。如果您已经对这些矩阵进行了一些 CUBLAS 调用,您将希望在 GPU 内存中已有的数组上使用它们。
So this may well be something you do have to implement yourself, because the library won't do it for you. (There's probably some way to implement it some of it in terms of BLAS level 3 routines - certainly the squaring of the matrix elements - but it would involve expensive and otherwise unnecessary matrix-vector multiplications. And I still don't know how you'd do the squareroot operation). The reason is that these operations aren't really linear-algebra procedures; taking the square root of each matrix element doesn't really correspond to any fundamental linear algebra operation.
The good news is that these elementwise operations are very simple to implement in CUDA. Again, there are lots of tuning options one could play with for best performance, but one can get started fairly easily.
As with the matrix addition operations, you'll be treating the NxM matricies here as (N*M)-length vectors; the structure of the matrix doesn't matter for these elementwise operations. So you'll be passing in a pointer to the first element of the matrix and treating it as a single list of N*M numbers. (I'm going to assume you're using
float
s here, as you were talking aboutSGEMM
andSAXPY
earlier.)The kernel, the actual bit of CUDA code which implements the operation, is quite simple. For now, each thread will compute the square (or squareroot) of one array element. (Whether this is optimal or not for performance is something you could test). So the kernels would look like the following. I'm assuming you're doing something like B_ij = (A_ij)^2; if you wanted to do the operation inplace, eg A_ij = (A_ij)^2, you could do that, too:
Note that if you're ok with very slightly increased error, the 'sqrtf()' function which has maximum error of 3 ulp (units in the last place) is significantly faster.
How you call these kernels will depend on the order in which you're doing things. If you've already made some CUBLAS calls on these matricies, you'll want to use them on the arrays which are already in GPU memory.