使用 CUDA 缩放矩阵的行
在 GPU 上的一些计算中,我需要缩放矩阵中的行,以便给定行中的所有元素总和为 1。
| a1,1 a1,2 ... a1,N | | alpha1*a1,1 alpha1*a1,2 ... alpha1*a1,N | | a2,1 a2,2 ... a2,N | => | alpha2*a2,1 alpha2*a2,2 ... alpha2*a2,N | | . . | | . . | | aN,1 aN,2 ... aN,N | | alphaN*aN,1 alphaN*aN,2 ... alphaN*aN,N |
其中
alphai = 1.0/(ai,1 + ai,2 + ... + ai,N)
我需要 alpha 向量和缩放后的矩阵我希望尽可能少地调用 blas 来完成此操作。该代码将在 nvidia CUDA 硬件上运行。有谁知道有什么聪明的方法可以做到这一点?
In some computations on the GPU, I need to scale the rows in a matrix so that all the elements in a given row sum to 1.
| a1,1 a1,2 ... a1,N | | alpha1*a1,1 alpha1*a1,2 ... alpha1*a1,N | | a2,1 a2,2 ... a2,N | => | alpha2*a2,1 alpha2*a2,2 ... alpha2*a2,N | | . . | | . . | | aN,1 aN,2 ... aN,N | | alphaN*aN,1 alphaN*aN,2 ... alphaN*aN,N |
where
alphai = 1.0/(ai,1 + ai,2 + ... + ai,N)
I need the vector of alpha
's, and the scaled matrix and I would like to do this in as few blas calls as possible. The code is going to run on nvidia CUDA hardware. Does anyone know of any smart way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Cublas 5.0 引入了一个类似 blas 的例程,称为 cublas(Type)dgmm,它是矩阵与对角矩阵(由向量表示)的乘法。
有一个左侧选项(将缩放行)或一个右侧选项将缩放列。
详细信息请参阅 CUBLAS 5.0 文档。
因此,在您的问题中,您需要创建一个包含 GPU 上所有 alpha 的向量,并使用 cublasdgmm 和 left 选项。
Cublas 5.0 introduced a blas-like routine called cublas(Type)dgmm which is the multiplication of a matrix by a diagonal matrix (represented by a vector).
There is a left option ( which will scale the rows) or a right option that will scale the column.
Please refer to CUBLAS 5.0 documentation for details.
So in your problem, you need to create a vector containing all the alpha on the GPU and use cublasdgmm with the left option.
如果您将 BLAS
gemv
与单位向量一起使用,结果将是您需要的缩放因子 (1/alpha) 倒数的向量。这是最简单的部分。逐行应用因子有点困难,因为标准 BLAS 没有类似 Hadamard 乘积运算符的功能可供使用。另外,因为您提到了 BLAS,我推测您正在为矩阵使用列主序存储,这对于行操作来说并不那么简单。 非常慢的方法是在每一行上以间距进行 BLAS
scal
,但这需要每行一次 BLAS 调用和由于对合并和 L1 缓存一致性的影响,倾斜的内存访问会降低性能。我的建议是使用您自己的内核进行第二次操作。它不必那么复杂,也许只需像这样:
只有一堆块按列逐行遍历,并在行进时进行缩放。应该适用于任何大小的列主有序矩阵。内存访问应该合并,但根据您对性能的担忧程度,您可以尝试多种优化。它至少给出了该做什么的总体思路。
If you use BLAS
gemv
with a unit vector, the result will a vector of the reciprocal of scaling factors (1/alpha) you need. That is the easy part.Applying the factors row wise is a bit harder, because standard BLAS doesn't have anything like a Hadamard product operator you could use. Also because you are mentioning BLAS, I presume you are using column major order storage for your matrices, which is not so straightforward for row wise operations. The really slow way to do it would be to BLAS
scal
on each row with a pitch, but that would require one BLAS call per row and the pitched memory access will kill performance because of the effect on coalescing and L1 cache coherency.My suggestion would be to use your own kernel for the second operation. It doesn't have to be all that complex, perhaps only something like this:
That just has a bunch of blocks stepping through the rows columnwise, scaling as they go along. Should work for any sized column major ordered matrix. Memory access should be coalesced, but depending on how worried about performance you are, there are a number of optimization you could try. It at least gives a general idea of what to do.
我想用一个考虑使用 CUDA Thrust 的
thrust::transform
和cuBLAS
的cublasdgmm
的示例来更新上面的答案>。我跳过缩放因子alpha
的计算,因为这已经在使用 CUDA 减少矩阵行
和
使用 CUDA 减少矩阵列
下面是一个完整的示例:
Utilities.cu 和
Utilities.cuh
文件被维护 此处并省略此处。TimingGPU.cu
和TimingGPU.cuh
维护于此处< /a> 和 也被省略。我已在 Kepler K20c 上测试了上述代码,结果如下:
在 cuBLAS 计时中,我排除了 cublasCreate 时间。即便如此,CUDA Thrust 版本似乎更方便。
I want to update the answers above with an example considering the use of CUDA Thrust's
thrust::transform
and ofcuBLAS
'scublas<t>dgmm
. I'm skipping the calculation of the scaling factorsalpha
's, since this has been already dealt with atReduce matrix rows with CUDA
and
Reduce matrix columns with CUDA
Below is a complete example:
The
Utilities.cu
andUtilities.cuh
files are mantained here and omitted here. TheTimingGPU.cu
andTimingGPU.cuh
are maintained here and are omitted as well.I have tested the above code on a Kepler K20c and these are the result:
In the
cuBLAS
timing, I'm excluding thecublasCreate
time. Even with this, the CUDA Thrust version seems to be more convenient.