使用 C++ 中的并行处理加速矩阵计算?
我正在尝试
计算以下内容: Y = Y0 - ( Un.(A*Y0) + Vn.(Y0*Z) )*dt
以最快/最有效的方式 ,其中 Y0、Un、Vn、A 和Z 是尺寸约为 300 X 300 的矩阵,“.”为矩阵点积,“*”表示矩阵乘法。
我的问题是:
- 并行计算计算独立的子矩阵 A2 = A*Y0 和 Z2 = Y0*Z,然后 Un2 = Un.*A2 和 Vn2 = Vn.*Z2 是否比串行计算更快,使得 Y = Y0 - (Un2 + Vn2)*dt?如果是这样,那么如何完成这种并行计算的一个好例子是什么?
是否还有其他更好/推荐的方法(例如,使用 ATLAS)?
该语言是 C++,将在具有多核(至少双)处理器的 Linux 或 Windows 平台上运行。我目前使用 BOOST uBLAS 作为 BLAS 包。
I'm trying to compute the following:
Y = Y0 - ( Un.(A*Y0) + Vn.(Y0*Z) )*dt
in the fastest/most efficient manner possible where Y0, Un, Vn, A, and Z are matrices dimensioned on the order of 300 X 300, "." is the matrix dot product, and "*" represents matrix multiplication.
My questions are:
Is computing the computationally independent sub-matrices A2 = A*Y0 and Z2 = Y0*Z, then Un2 = Un.*A2 and Vn2 = Vn.*Z2, in parallel faster than computing them serially, such that Y = Y0 - (Un2 + Vn2)*dt? If so, what is a good example of how this parallel computation would be done?
Is there some other better/recommended approach (e.g., using ATLAS)?
The language is C++ and this is to be run on a Linux or Windows platform with multi-core (at least dual) processors. I'm currently using BOOST uBLAS as the BLAS package.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
OpenMP 应该是一种快速、简单的方法来查看并行路线是否会更快。
OpenMP should be a quick and easy way of seeing if the parallel route would be faster.
我同意@genpfault,在我运行多个循环的实验中,我使用 OpenMP,它非常有用且更易于使用!这里是 chryswoods 博客的链接,OpenMPs 基础知识,它是我见过的最简单的教程之一。
I agree with @genpfault, on my experiments running several loops I'm using OpenMP and it is very useful and easier to use! Here is a link of chryswoods' blog, OpenMPs basics and it is one of the easiest tutorials I have seen.
你的问题很小。您应该尝试使用 Eigen (或您提到的 ATLAS)之类的东西。我更喜欢 Eigen,因为它使用起来很快。
Your problem is very small. You should try using something like Eigen (or as you mentioned ATLAS). I prefer Eigen since it is fast to use.
当我尝试用 boost ublas 来乘以类似的矩阵时,我得到了~3GFLOPS。实现缓存感知矩阵乘法使我达到了约 12GFLOPS。使用 OpenMP 并行缓存感知乘法使我达到了约 30GFLOPS(4 核,2 个线程/核)。
因此,首先,您应该确保使用缓存感知矩阵乘法算法(或者如果您愿意,可以使用缓存不经意的算法)让它变得花哨)。然后你可以并行化,但你想让你的并行性尽可能粗粒度,否则阿姆达尔定律就会打击你。
一个好的经验法则是选择一个至少需要 1 秒执行的工作单元,并将其并行化。这里的矩阵乘法只需要几毫秒,所以我肯定会选择更大的东西。例如,您可以并行计算多个批次,而不是尝试并行计算 Y 的单个计算。
I got ~3GFLOPS when trying to multiply similar matrices with boost ublas. Implementing a cache aware matrix multiplication got me to ~12GFLOPS. Parallelising the cache-aware multiplication with OpenMP got me to ~30GFLOPS (4 cores, 2 threads/core)
So first of all, you should ensure that you are using a cache-aware matrix multiplication algorithm (or cache oblivious one if you like to make it fancy). Then you can parallelise, but you want to make your parallelism as coarse grained as possible, else Amdahl's law will hit you.
A good rule of thumb is to pick a unit of work that takes at least 1s to execute, and parallelise that. Here a matrix multiplication only takes a few milliseconds, so I would definitely pick something bigger. E.g., instead of trying to parallelise a single calculation of Y, you could calculate several batches of them in parallel.