共享内存设置中的 MPI 开销

发布于 2024-08-07 19:15:16 字数 307 浏览 2 评论 0原文

我想要并行化一个程序。对于在共享内存中处理一个大数据结构的线程来说,这并不是那么困难。 但我希望能够在集群上使用它,并且我必须选择一种技术来做到这一点。 MPI 是一种想法。

问题是,如果我跳过共享内存专用版本的实现并让 MPI 处理所有情况,MPI(或其他技术)会有什么开销?

更新:

我想在许多计算机上同时生长一个大型数据结构(游戏树)。 它的大部分部分仅位于一个集群节点上,但其中一些部分(树的不规则顶部)将不时地共享和同步。

在共享内存机器上,我希望通过共享内存来实现这一点。 这可以通用吗?

I want parallelize a program. It's not that difficult with threads working on one big data-structure in shared memory.
But I want to be able to use distribute it over cluster and I have to choose a technology to do that. MPI is one idea.

The question is what overhead will have MPI (or other technology) if I skip implementation of specialized version for shared memory and let MPI handle all cases ?

Update:

I want to grow a large data structure (game tree) simultaneously on many computers.
Most parts of it will be only on one cluster node but some of it (unregular top of the tree) will be shared and synchronized from time to time.

On shared memory machine I would like to have this achieved through shared memory.
Can this be done generically?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

瘫痪情歌 2024-08-14 19:15:16

所有流行的 MPI 实现都将通过共享内存进行本地通信。只要您不花费所有时间打包和解包缓冲区(即您的设计合理),性能就非常好。事实上,MPI 强加给您的设计可以比大多数线程实现更好地执行,因为单独的地址空间提高了缓存一致性。为了持续击败 MPI,线程实现必须了解缓存层次结构以及其他内核正在做什么。

对于良好的网络硬件(如 InfiniBand),HCA 负责将缓冲区接入和断开网络,以便 CPU 可以执行其他操作。此外,由于许多作业都受到内存带宽的限制,因此与每个插槽使用多个内核相比,在多个节点上使用例如每个插槽上的 1 个内核,它们的性能会更好。

All the popular MPI implementations will communicate locally via shared memory. The performance is very good as long as you don't spend all your time packing and unpacking buffers (i.e. your design is reasonable). In fact, the design imposed upon you by MPI can perform better than most threaded implementations because the separate address space improves cache coherence. To consistently beat MPI, the threaded implementations have to be aware of the cache hierarchy and what the other cores are working on.

With good network hardware (like InfiniBand) the HCA is responsible for getting your buffers on and off the network so the CPU can do other things. Also, since many jobs are memory bandwidth limited, they will perform better using, e.g. 1 core on each socket across multiple nodes than when using multiple cores per socket.

找个人就嫁了吧 2024-08-14 19:15:16

这取决于算法。无论是进程间通信还是进程内的多个线程,明显的集群间通信比共享内存慢几个数量级。因此,您希望最小化集群间流量,例如,通过在可能且可行的情况下复制数据,或者以最小化节点间通信的方式分解问题。

对于节点间通信很少的“令人尴尬”的并行算法,这是一个简单的选择 - 这些问题类似于暴力搜索加密密钥,其中每个节点可以长时间处理数字并定期向中央节点报告,但不需要通信测试键。

It depends on the algorithm. Clealy inter-cluster communication is orders of magnitude slower than shared memory either as inter-process communication or multiple threads within a process. Therefore you want to minimize inter-cluster traffic, E.g. by duplicating data where possible and practicable or breaking the problem down in such a way that minimizes inter node communication.

For 'embarrisngly' parallel algorithms with little inter-node communication it's an easy choice - these are problems like brute force searching for encryption key where each node can crunch numbers for long periods and report back to a central node periodically but no communication is required to test keys.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文