按主机对 MPI 任务进行分组
我想在集群中的每台机器上轻松独立地执行集体通信。假设我有 4 台机器,每台机器有 8 个核心,我的 MPI 程序将运行 32 个 MPI 任务。我想要的是,对于给定的函数:
- 在每个主机上,只有一个任务执行计算,其他任务在此计算期间不执行任何操作。在我的示例中,4 个 MPI 任务将执行计算,另外 28 个任务正在等待。
- 一旦计算完成,每个 MPI 任务将仅对本地任务(在同一主机上运行的任务)执行集体通信。
从概念上讲,我知道我必须为每个主机创建一个通信器。我四处搜寻,没有发现任何明确的做法。我对 MPI 团体和沟通者不太满意。这里我有两个问题:
- 对于这种行为来说,
MPI_Get_processor_name
是否足够独特? - 更一般地说,你有一段代码可以做到这一点吗?
I want to easily perform collective communications independently on each machine of my cluster. Let's say I have 4 machines with 8 cores on each, my MPI program would run 32 MPI tasks. What I would like is, for a given function:
- on each host, only one task performs a computation, the other tasks do nothing during this computation. In my example, 4 MPI tasks will do the computation, 28 others are waiting.
- once the computation is done, each MPI task on each will perform a collective communication ONLY to local tasks (tasks running on the same host).
Conceptually, I understand I must create one communicator for each host. I searched around, and found nothing explicitly doing that. I am not really comfortable with MPI groups and communicators. Here my two questions:
- is
MPI_Get_processor_name
is enough unique for such a behaviour? - more generally, do you have a piece of code doing that?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
该规范表示,
MPI_Get_processor_name
返回“实际(而不是虚拟)节点的唯一说明符”,所以我认为您对此表示同意。我猜你会进行一次收集来组合所有主机名,然后分配处理器组来启动它们的通信器;或 dup MPI_COMM_WORLD,将名称转换为整数哈希,并使用 mpi_comm_split 对集合进行分区。您还可以采用 janneb 建议的方法,并使用 mpirun 的特定于实现的选项,以确保 MPI 实现以这种方式分配任务; OpenMPI 使用 --byslot 生成此排序;对于 mpich2,您可以使用 -print-rank-map 来查看映射。
但这真的是你想做的吗?如果当一个处理器正在工作时其他进程处于空闲状态,这比每个人都冗余地进行计算更好吗? (或者这是否非常占用内存或 I/O,并且您担心争用?)如果您打算做很多这样的事情——处理节点上并行化与节点外并行化非常不同——那么您可能需要考虑混合编程模型 - 每个节点运行一个 MPI 任务和 MPI_spawning 子任务,或者使用 OpenMP 进行节点上通信,两者均如 HPM 所建议的那样。
The specification says that
MPI_Get_processor_name
returns "A unique specifier for the actual (as opposed to virtual) node", so I think you'd be ok with that. I guess you'd do a gather to assemble all the host names and then assign groups of processors to go off and make their communicators; or dup MPI_COMM_WORLD, turn the names into integer hashes, and use mpi_comm_split to partition the set.You could also take the approach janneb suggests and use implementation-specific options to mpirun to ensure that the MPI implementation assigns tasks that way; OpenMPI uses --byslot to generate this ordering; with mpich2 you can use -print-rank-map to see the mapping.
But is this really what you want to do? If the other processes are sitting idle while one processor is working, how is this better than everyone redundantly doing the calculation? (Or is this very memory or I/O intensive, and you're worried about contention?) If you're going to be doing a lot of this -- treating on-node parallelization very different from off-node parallelization -- then you may want to think about hybrid programming models - running one MPI task per node and MPI_spawning subtasks or using OpenMP for on-node communications, both as suggested by HPM.
我不认为(有根据的想法,不是确定的)您将能够完全在 MPI 程序中做您想做的事情。
系统对
MPI_Get_processor_name
调用的响应取决于系统;在您的系统上,它可能会根据需要返回node00
、node01
、node02
、node03
,或者可能返回my_big_computer
适用于您实际运行的任何处理器。前者的可能性更大,但不能保证。一种策略是启动 32 个进程,如果您可以确定每个进程在哪个节点上运行,请将通信器分为 4 组,每个节点上一组。这样您就可以根据需要自行管理内部和内部通信。
另一种策略是启动 4 个进程并将它们固定到不同的节点。如何将进程固定到节点(或处理器)将取决于您的 MPI 运行时以及您可能拥有的任何作业管理系统,例如 Grid Engine。这可能涉及设置环境变量——但是您没有告诉我们有关您的运行时系统的任何信息,因此我们无法猜测它们可能是什么。然后,您可以让 4 个进程中的每一个进程动态生成另外 7 个(或 8 个)进程,并将它们固定到与初始进程相同的节点。为此,请阅读内部通信器主题和运行时系统文档。
第三种策略(现在有点疯狂)是启动 4 个独立的 MPI 程序(每个程序 8 个进程),集群的每个节点上都有一个,并在它们执行时加入它们。有关详细信息,请阅读
MPI_Comm_connect
和MPI_Open_port
。最后,为了获得额外的乐趣,您可以考虑混合您的程序,在每个节点上运行一个 MPI 进程,并让每个进程执行一个 OpenMP 共享内存(子)程序。
I don't think (educated thought, not definitive) that you'll be able to do what you want entirely from within your MPI program.
The response of the system to a call to
MPI_Get_processor_name
is system-dependent; on your system it might returnnode00
,node01
,node02
,node03
as appropriate, or it might returnmy_big_computer
for whatever processor you are actually running on. The former is more likely, but it is not guaranteed.One strategy would be to start 32 processes and, if you can determine what node each is running on, partition your communicator into 4 groups, one on each node. This way you can manage inter- and intra-communications yourself as you wish.
Another strategy would be to start 4 processes and pin them to different nodes. How you pin processes to nodes (or processors) will depend on your MPI runtime and any job management system you might have, such as Grid Engine. This will probably involve setting environment variables -- but you don't tell us anything about your run-time system so we can't guess what they might be. You could then have each of the 4 processes dynamically spawn a further 7 (or 8) processes and pin those to the same node as the initial process. To do this, read up on the topic of intercommunicators and your run-time system's documentation.
A third strategy, now it's getting a little crazy, would be to start 4 separate MPI programs (8 processes each), one on each node of your cluster, and to join them as they execute. Read about
MPI_Comm_connect
andMPI_Open_port
for details.Finally, for extra fun, you might consider hybridising your program, running one MPI process on each node, and have each of those processes execute an OpenMP shared-memory (sub-)program.
通常,您的 MPI 运行时环境可以通过环境变量来控制,例如任务如何在节点上分布。默认情况往往是顺序分配,也就是说,对于分布在 4 个 8 核机器上的 32 个任务的示例,您将拥有
是的,MPI_Get_processor_name 应该为您提供主机名,以便您可以找出主机之间的边界在哪里。
Typically your MPI runtime environment can be controlled e.g. by environment variables how tasks are distributed over nodes. The default tends to be sequential allocation, that is, for your example with 32 tasks distributed over 4 8-core machines you'd have
And yes, MPI_Get_processor_name should get you the hostname so you can figure out where the boundaries between hosts are.
现代 MPI 3 的答案是调用
MPI_Comm_split_type
The modern MPI 3 answer to this is to call
MPI_Comm_split_type