MPI 和 OpenMP。我还有选择吗?
我有一个线性代数代码,我试图让它运行得更快。它是一种迭代算法,内部有循环和矩阵向量乘法。 到目前为止,我已经使用过 MATMUL (Fortran Lib.)、DGEMV,尝试在 OpenMP 中编写自己的 MV 代码,但该算法在可扩展性方面并没有做得更好。无论我分配给它多少个处理器(我已经尝试过 64 个处理器),速度提升都仅为 3.5 - 4。 分析显示,Matrix-Vector 花费了大量时间,其余时间则相当微不足道。 我的问题是: 我有一个带有大量 RAM 和处理器的共享内存系统。我尝试过调整代码的 OpenMP 实现(包括矩阵向量),但没有帮助。对 MPI 编码有帮助吗?我不是 MPI 的专业人士,但微调消息通信的能力可能会有所帮助,但我不能确定。有什么意见吗?
更一般地说,从我读过的文献来看,MPI = 分布式,OpenMP = 共享,但它们可以在其他领域表现良好吗?喜欢共享中的 MPI 吗?它会起作用吗?如果做得好的话会比 OpenMP 实现更好吗?
I have a linear algebra code that I am trying get to run faster. Its a iterative algorithm with a loop and matrix vector multiplications within in.
So far, I have used MATMUL (Fortran Lib.), DGEMV, Tried writing my own MV code in OpenMP but the algorithm is doing no better in terms of scalability. Speed ups are barely 3.5 - 4 irrespective of how many processors I am allotting to it (I have tried up 64 processors).
The profiling shows significant time being spent in Matrix-Vector and the rest is fairly nominal.
My question is:
I have a shared memory system with tons of RAM and processors. I have tried tweaking OpenMP implementation of the code (including Matrix Vector) but has not helped. Will it help to code in MPI? I am not a pro at MPI but the ability to fine tune the message communication might help a bit but I can't be sure. Any comments?
More generally, from the literature I have read, MPI = Distributed, OpenMP = Shared but can they perform well in the others' territory? Like MPI in Shared? Will it work? Will it be better than the OpenMP implementation if done well?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您最好只使用已经针对多核环境进行了很好优化的线性代数包,并将其用于矩阵向量乘法。 Atlas 包,gotoblas (如果您有 nehalem 或更旧版本;遗憾的是它不再更新),或供应商 BLAS 实现(例如MKL 适用于 intel CPU,ACML(适用于 AMD),或 VecLib for apple,这些都需要花钱)都具有良好的、经过良好调整的多线程BLAS 实施。除非您有充分的理由相信您可以比那些全职开发团队做得更好,否则您最好使用它们。
请注意,使用 DGEMV 永远不会获得像使用 DGEMM 那样的并行加速,只是因为向量比另一个矩阵小,因此工作量较少;但您仍然可以做得很好,并且您会发现使用这些库比使用任何手动操作获得的性能要好得多,除非您已经进行了多级缓存阻塞。
You're best off just using a linear algebra package that is already well optimized for a multitcore environment and using that for your matrix-vector multiplication. The Atlas package, gotoblas (if you have a nehalem or older; sadly it's no longer being updated), or vendor BLAS implementations (like MKL for intel CPUs, ACML for AMD, or VecLib for apple, which all cost money) all have good, well-tuned, multithreaded BLAS implementations. Unless you have excellent reason to believe that you can do better than those full time development teams can, you're best off using them.
Note that you'll never get the parallel speedup with DGEMV that you do with DGEMM, just because the vector is smaller than another matrix and so there's less work; but you can still do quite well, and you'll find you get much better perforamance with these libraries than you do with anything hand-rolled unless you were already doing multi-level cache blocking.
您可以在共享环境中使用 MPI(但不能在分布式环境中使用 OpenMP)。然而,实现良好的加速更多地取决于算法和数据依赖性,而不是所使用的技术。由于您有大量共享内存,我建议您坚持使用 OpenMP,并仔细检查是否充分利用了资源。
You can use MPI in a shared environment (though not OpenMP in a distributed one). However, achieving a good speedup depends a lot more on your algorithms and data dependencies than the technology used. Since you have a lot of shared memory, I'd recommend you stick with OpenMP, and carefully examine whether you're making the best use of your resources.