关于解决并行化问题的一般问题
我有一个关于 C 语言并行算法编程的一般性问题。假设我们的任务是使用 MPI 和/或 OpenMP 实现一些矩阵算法。在某些情况下,例如 OpenMP 或 MPI 中的错误共享,其中通信的出现取决于矩阵维度(进程之间循环分布的列),这会导致一些问题。例如,通过转置矩阵来解决这种情况是否是一个好的且常见的尝试,因为这会减少必要的通信,甚至避免错误共享问题?之后您将撤消转置。当然,假设这会带来更好的加速。 我不认为这是非常狡猾的,而且是一种懒惰的方式来做到这一点。但我很好奇阅读有关此的一些观点。
i have a general question about programming of parallel algorithms in C. Lets assume that our task is to implement some matrix algorithms with MPI and/or OpenMP. There are some situations, like false sharing in OpenMP or in MPI where the communications arise in dependence of the matrix dimension (columns cyclic distrubuted among processes), which cause some problems . Would it be a good and a common attempt to solve this situations by, for example, transposing the matrix, because this would reduce the necessary communications or even avoiding the false sharing problem? After that you would undo the transposition. Of course, assuming that this would lead to a much better speed up.
I dont think that this would be very cunning and more of a lazy way to do this. But im curious to read some opions about this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我们先从第一个问题开始:转置有意义吗?答案是,这取决于情况,你可以估计它是否会改善情况。
转置/重新转置会施加一次性内存带宽成本 2*(以快速方式通过内存)+ 2*(以慢速方式通过内存),其中这些内存操作实际上是多核情况或网络中的内存操作分布式内存情况下的通信。您将以快速方式读取矩阵并以慢速方式将其放入内存中。 (本质上,您可以通过一次读取一个缓存大小的块中的矩阵、在缓存中转置并按顺序写出来实现 4*(以快速方式遍历内存)。
这是否成功取决于您访问该数组的次数。如果您以“错误”方向访问整个非转置数组 4 次,那么通过执行两次转置您显然会获胜。如果您仅以错误的方向遍历非转置数组一次,那么您几乎肯定不会通过进行转置来获胜。
至于更大的问题,@AlexandreC 绝对是对的——尝试实现你自己的线性代数例程是疯狂的。看一下,例如,如何编写快速数字代码,图3;简单的和高度调整的(比如)GEMM 操作之间的性能差异可能是 40。这些东西的内存带宽受到很大限制,同时这也意味着网络受到限制。到目前为止,最好的方法是使用现有的工具。
对于多核线性代数,现有库包括
对于 MPI 实现,有
或完整的求解器环境,如
Let's start with the first question first: can it make sense to transpose? The answer is, it depends, and you can estimate whether it will improve things or not.
The transposition/retransposition with impose a one-time memory bandwidth cost of 2*(going through memory the fast way) + 2*(going through memory the slow way) where those memory operations are literally memory operations in the multicore case, or network communications in the distributed memory case. You're going to be reading a matrix in the fast way and putting it into memory the slow way. (You can make this, essentially, 4*(going through memory the fast way) by reading the matrix in one cache-sized block at a time, transposing in cache, and writing out in order).
Whether or not that's a win or not depends on how many times you'll be accessing the array. If you would have been hitting the entire non-transposed array 4 times with memory access in the "wrong" direction, then you will clearly win by doing the two transposes. If you'd only be going through the non-transposed array once in the wrong direction, then you almost certainly won't win by doing the transposition.
As to the larger question, @AlexandreC is absolutely right here -- trying to implement your own linear algebra routines is madness. Take a look at, eg, How To Write Fast Numerical Code, figure 3; there can be factors of 40 in performance between naive and highly-tuned (say) GEMM operations. These things are highly memory-bandwidth limited, and in parallel that means network limited. By far, best is to use existing tools.
For multicore linear algebra, existing libraries include
For MPI implementations, there are
or complete solver environments like
我不知道您是否会在完成操作后立即丢弃转置,但是,这是增加并行性的有效机制。
我不是专家;我只读过一些关于这个主题的内容,即使那是针对 SIMD 架构的,所以请不要轻视我的意见......但我认为通常的机制是将你的结构放在内存中以匹配机器(所以你转置一个大矩阵以更好地与向量对齐并增加循环中的依赖距离),然后您还围绕该矩阵构建指针的索引结构,以便您可以以不同的方式快速访问转置中的各个元素。随着输入的变化更加动态,这变得更加困难。
I don't know that you'd throw the transpose away the second that you completed the operation, but yes this is a valid mechanism to increase parallelism.
I am not an expert; I've only read a little bit about this topic, and even that was for SIMD architectures, so please take my opinion lightly... but I think the usual mechanism is to lay your structures out in memory to match the machine (so you'd transpose a large matrix to line up better with your vectors and increase the dependency distance in your loops), and then you also build an indexing structure of pointers around that so that you can quickly access individual elements in the transpose differently. This gets more difficult to do as your input changes more dynamically.
懒惰的解决方案通常比“狡猾”的解决方案更好,因为它们往往更简单、更直接。因此,它们更容易实施、记录、理解和维护。事实上,懒惰可以说是程序员可以拥有的最大美德之一。只要程序以可接受的速度产生正确的结果,没有人应该关心你如何优雅地解决问题(包括你)。
Lazy solutions are usually better than "cunning" ones, because they tend to be more simple and straightforward. They're therefore easier to implement, document, understand and maintain. Indeed, laziness is arguably one of the greatest virtues a programmer can have. As long as the program produces correct results at acceptable speeds, nobody should care how elegantly you solved the problem (including you).