C++ 中的高效矩阵分解为方形子矩阵
我通过使用一维数据类型并将其包装成行和列,在 C++ 中实现了矩阵数据类型。现在,我希望能够从此时开始创建方形/分块子矩阵,并且我想在内存中进行操作。
问题是我希望其中一些子矩阵可以转移到 GPU 内存并可以在那里并行处理它们。例如,这对于矩阵乘法很有用。由于这些子矩阵在主内存中未对齐,因此如果不创建单独的副本,将它们作为单个单元复制到设备内存似乎是不可能的?我希望将这种直接 GPU 子矩阵复制映射到 CPU 原始矩阵,以达到更新和提高效率的目的。我事先不知道确切的分区。
有人知道我怎样才能实现它吗?
提醒一下,矩阵需要按块划分,而不是按行划分,这在 C/C++ 中相对容易。
I have implemented a Matrix datatype in C++ by using 1D datatype and wrapping it into rows and columns. Now, I want to have this possibility to create square/blocked sub-matrices from this time and I want to do it in-memory.
The problem is that I want some of these sub-matrices to be transferable to GPU memory and can process them there in parallel. This is for example, useful for Matrix Multiplication. As these submatrices are not aligned in main-memory, copying them to device memory as a single unit looks impossible without creating separate copy? I want to have this direct GPU sub-matrix copy mapping to CPU-original matrix for updation and efficiency purpose. I don't know about exact partitioning in advance.
Do someone has some idea how can I achieve it possibly?
Just a reminder, matrix needs to be partitioned in blocks and not row-wise which will be relatively easy in C/C++.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果在创建“主”矩阵时已知所需的子矩阵,并且它们形成主矩阵的一个分区,则可以创建一个类似于这样的复合矩阵类:
“PlainMatix”可以组织为内存高效的方式。
If the required sub-matrices are known at the time the 'master' matrix is created, and if they form a partition of the master, it's possible to create a composite matrix class somewhat like this:
The 'PlainMatix' can be organized in a memory-efficient way.
如果矩阵的维度是 2 的幂,您可以将它们以 z-order。这样,您只需要子矩阵的开始索引和结束索引即可通过一次调用
cudaMemcpy
来复制它。If your matrices' dimensions are powers of 2, you can store them in host memory in z-order. This way, you just need the start- and end-index of a submatrix to copy it with one call to
cudaMemcpy
.