No, in my opinion it is unsuitable for most processing you would do on a multicore system. The overhead is too high, the objects you pass around must be deeply cloned, and passing large objects graphs around to then run a very small computation is very inefficient. It is really meant for sharing data between separate processes, most often running in separate memory spaces, and most often running long computations. A multicore processor is a shared memory machine, so there are much more efficient ways to do parallel processing, that do not involve copying objects and where most of the threads run for a very small time. For example, think of a multithreaded Quicksort. The overhead of allocating memory and copying the data to a thread before it can be partioned will be much slower with MPI and an unlimited number of processors than Quicksort running on a single processor. As an example, in Java, I would use a BlockingQueue (a shared memory construct), to pass object references between threads, with very little overhead. Not that it does not have its place, see for example the Google search cluster that uses message passing. But it's probably not the problem you are trying to solve.
I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.
MPI is not inefficient. You need to break the problem down into chunks and pass the chunks around and reorganize when the result is finished per chunk. No one in the right mind would pass around the whole object via MPI when only a portion of the problem is being worked on per thread. Its not the inefficiency of the interface or design pattern thats the inefficiency of the programmers knowledge of how to break up a problem.
When you use a locking mechanism the overhead on the mutex does not scale well. this is due to the fact that the underlining runqueue does not know when you are going to lock the thread next. You will perform more kernel level thrashing using mutex's than a message passing design pattern.
MPI has a very large amount of overhead, primarily to handle inter-process communication and heterogeneous systems. I've used it in cases where a small amount of data is being passed around, and where the ratio of computation to data is large. This is not the typical usage scenario for most consumer or business tasks, and in any case, as a previous reply mentioned, on a shared memory architecture like a multicore machine, there are vastly faster ways to handle it, such as memory pointers.
If you had some sort of problem with the properties describe above, and you want to be able to spread the job around to other machines, which must be on the same highspeed network as yourself, then maybe MPI could make sense. I have a hard time imagining such a scenario though.
You have to decide if you want low level threading or high level threading. If you want low level then use pThread. You have to be careful that you don't introduce race conditions and make threading performance work against you.
I have used some OSS packages for (C and C++) that are scalable and optimize the task scheduling. TBB (threading building blocks) and Cilk Plus are good and easy to code and get applications of the ground. I also believe they are flexible enough integrate other thread technologies into it at a later point if needed (OpenMP etc.)
I personally have taken up Erlang( and i like to so far). The messages based approach seem to fit most of the problem and i think that is going to be one of the key item for multi core programming. I never knew about the overhead of MPI and thanks for pointing it out
如果你想在单芯片多核机器上使用MPI,它会工作得很好。 事实上,最新版本的 Mac OS X 预装了 OpenMPI,您可以下载安装OpenMPI 在普通的多核 Linux 机器上非常轻松地运行。 洛斯阿拉莫斯 的大多数系统都在使用 OpenMPI。 利弗莫尔使用mvapich 在他们的 Linux 集群上。 在深入研究之前您应该记住的是,MPI 是为解决分布式内存系统上的大规模科学问题而设计的。 您正在处理的多核机器可能具有共享内存。
I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.
发布评论
评论(7)
不,在我看来,它不适合在多核系统上执行的大多数处理。 开销太高,您传递的对象必须进行深度克隆,并且传递大型对象图然后运行非常小的计算效率非常低。 它实际上是为了在单独的进程之间共享数据,通常在单独的内存空间中运行,并且通常运行长时间的计算。
多核处理器是共享内存的机器,因此有更有效的方法来进行并行处理,这些方法不涉及复制对象,并且大多数线程运行的时间很短。 例如,考虑多线程快速排序。 使用 MPI 和无限数量的处理器,分配内存以及在分区之前将数据复制到线程的开销将比在单个处理器上运行的 Quicksort 慢得多。
例如,在 Java 中,我将使用 BlockingQueue(一种共享内存构造)在线程之间传递对象引用,而开销非常小。
并不是说它没有自己的位置,例如,请参阅使用消息传递的 Google 搜索集群。 但这可能不是您想要解决的问题。
No, in my opinion it is unsuitable for most processing you would do on a multicore system. The overhead is too high, the objects you pass around must be deeply cloned, and passing large objects graphs around to then run a very small computation is very inefficient. It is really meant for sharing data between separate processes, most often running in separate memory spaces, and most often running long computations.
A multicore processor is a shared memory machine, so there are much more efficient ways to do parallel processing, that do not involve copying objects and where most of the threads run for a very small time. For example, think of a multithreaded Quicksort. The overhead of allocating memory and copying the data to a thread before it can be partioned will be much slower with MPI and an unlimited number of processors than Quicksort running on a single processor.
As an example, in Java, I would use a BlockingQueue (a shared memory construct), to pass object references between threads, with very little overhead.
Not that it does not have its place, see for example the Google search cluster that uses message passing. But it's probably not the problem you are trying to solve.
我不得不同意 tgamblin 的观点。 您可能必须卷起袖子,真正深入研究代码才能使用 MPI,自己显式地处理消息传递的组织。 如果这是您喜欢或不介意做的事情,我希望 MPI 在多核计算机上的工作效果与在分布式集群上的工作效果一样好。
从个人经验来看……我在研究生院编写了一些 C 代码,以便在每个节点本身就是一台多核机器的集群上进行一些大规模的电生理模型建模。 因此,我想到了几种不同的并行方法来解决这个问题。
1)我可以单独使用 MPI,将每个处理器视为它自己的“节点”,即使其中一些处理器组合在同一台机器上。
2) 我可以使用 MPI 来处理多核节点之间的数据移动,然后在每个多核机器中使用线程(POSIX 线程),其中处理器共享内存。
对于我正在研究的特定数学问题,我首先在一台多核计算机上测试了两种公式:一种使用 MPI,一种使用 POSIX 线程。 事实证明,MPI 实现的效率要高得多,双核机器的加速接近 2,而线程实现的加速为 1.3-1.4。 对于 MPI 代码,我能够组织操作,以便处理器很少空闲,在它们之间传递消息时保持忙碌,并屏蔽传输数据的大部分延迟。 使用线程代码,我最终遇到了很多互斥瓶颈,迫使线程在其他线程完成计算时经常坐着等待。 保持线程之间的计算负载平衡似乎无助于这一事实。
这可能只针对我正在研究的模型,对于其他类型的并行问题,线程与 MPI 的有效性可能会有很大差异。 尽管如此,我不同意 MPI 的开销太大。
I would have to agree with tgamblin. You'll probably have to roll your sleeves up and really dig into the code to use MPI, explicitly handling the organization of the message-passing yourself. If this is the sort of thing you like or don't mind doing, I would expect that MPI would work just as well on multicore machines as it would on a distributed cluster.
Speaking from personal experience... I coded up some C code in graduate school to do some large scale modeling of electrophysiologic models on a cluster where each node was itself a multicore machine. Therefore, there were a couple of different parallel methods I thought of to tackle the problem.
1) I could use MPI alone, treating every processor as it's own "node" even though some of them are grouped together on the same machine.
2) I could use MPI to handle data moving between multicore nodes, and then use threading (POSIX threads) within each multicore machine, where processors share memory.
For the specific mathematical problem I was working on, I tested two formulations first on a single multicore machine: one using MPI and one using POSIX threads. As it turned out, the MPI implementation was much more efficient, giving a speed-up of close to 2 for a dual-core machine as opposed to 1.3-1.4 for the threaded implementation. For the MPI code, I was able to organize operations so that processors were rarely idle, staying busy while messages were passed between them and masking much of the delay from transferring data. With the threaded code, I ended up with a lot of mutex bottlenecks that forced threads to often sit and wait while other threads completed their computations. Keeping the computational load balanced between threads didn't seem to help this fact.
This may have been specific to just the models I was working on, and the effectiveness of threading vs. MPI would likely vary greatly for other types of parallel problems. Nevertheless, I would disagree that MPI has an unwieldy overhead.
MPI 的效率并不低。 您需要将问题分解为多个块,并将这些块传递出去,并在每个块的结果完成后重新组织。 当每个线程只处理问题的一部分时,任何头脑正常的人都不会通过 MPI 传递整个对象。 并不是界面或设计模式的低效,而是程序员对如何分解问题的知识的低效。
当您使用锁定机制时,互斥体的开销不能很好地扩展。 这是因为下划线运行队列不知道您下次何时锁定线程。 与消息传递设计模式相比,您将使用互斥体执行更多的内核级颠簸。
MPI is not inefficient. You need to break the problem down into chunks and pass the chunks around and reorganize when the result is finished per chunk. No one in the right mind would pass around the whole object via MPI when only a portion of the problem is being worked on per thread. Its not the inefficiency of the interface or design pattern thats the inefficiency of the programmers knowledge of how to break up a problem.
When you use a locking mechanism the overhead on the mutex does not scale well. this is due to the fact that the underlining runqueue does not know when you are going to lock the thread next. You will perform more kernel level thrashing using mutex's than a message passing design pattern.
MPI 的开销非常大,主要是为了处理进程间通信和异构系统。 我在传递少量数据且计算与数据之比很大的情况下使用过它。
这不是大多数消费者或业务任务的典型使用场景,无论如何,正如前面的回复提到的,在像多核机器这样的共享内存架构上,有更快的方法来处理它,例如内存指针。
如果您对上述属性遇到某种问题,并且希望能够将工作分散到其他机器上,而这些机器必须与您位于同一高速网络上,那么 MPI 可能会有意义。 但我很难想象这样的场景。
MPI has a very large amount of overhead, primarily to handle inter-process communication and heterogeneous systems. I've used it in cases where a small amount of data is being passed around, and where the ratio of computation to data is large.
This is not the typical usage scenario for most consumer or business tasks, and in any case, as a previous reply mentioned, on a shared memory architecture like a multicore machine, there are vastly faster ways to handle it, such as memory pointers.
If you had some sort of problem with the properties describe above, and you want to be able to spread the job around to other machines, which must be on the same highspeed network as yourself, then maybe MPI could make sense. I have a hard time imagining such a scenario though.
您必须决定是否需要低级线程或高级线程。 如果你想要低级别,那么使用pThread。 您必须小心,不要引入竞争条件并使线程性能对您不利。
我使用了一些 OSS 包(C 和 C++),它们是可扩展的并优化了任务调度。 TBB(线程构建块)和 Cilk Plus 都很好,并且易于编码和实际应用。 我还相信它们足够灵活,可以在以后需要时将其他线程技术集成到其中(OpenMP 等)
www.threadingbuildingblocks.org
www.cilkplus.org
You have to decide if you want low level threading or high level threading. If you want low level then use pThread. You have to be careful that you don't introduce race conditions and make threading performance work against you.
I have used some OSS packages for (C and C++) that are scalable and optimize the task scheduling. TBB (threading building blocks) and Cilk Plus are good and easy to code and get applications of the ground. I also believe they are flexible enough integrate other thread technologies into it at a later point if needed (OpenMP etc.)
www.threadingbuildingblocks.org
www.cilkplus.org
我个人已经开始使用 Erlang(到目前为止我很喜欢)。 基于消息的方法似乎适合大多数问题,我认为这将成为多核编程的关键项目之一。 我从来不知道 MPI 的开销,感谢您指出这一点
I personally have taken up Erlang( and i like to so far). The messages based approach seem to fit most of the problem and i think that is going to be one of the key item for multi core programming. I never knew about the overhead of MPI and thanks for pointing it out
我在具有多核节点的大型集群上广泛使用了 MPI。 我不确定这对于单个多核盒子是否正确,但如果您预计有一天您的代码可能会比单个芯片更大,您可能会考虑在 MPI 中实现它。 目前,没有什么比 MPI 规模更大。 我不确定提到不可接受的管理费用的海报来自哪里,但我试图概述下面的相关权衡。 请继续阅读以了解更多信息。
MPI 是大规模科学计算事实上的标准,并且已经在多核机器上广泛使用。 它非常快。 查看最新的 500 强列表。 在某些情况下,该列表中的顶级机器拥有数十万个处理器,具有多插槽双核和四核节点。 其中许多机器都具有非常快速的自定义网络(环面、网格、树等)和可识别硬件的优化 MPI 实现。
如果你想在单芯片多核机器上使用MPI,它会工作得很好。 事实上,最新版本的 Mac OS X 预装了 OpenMPI,您可以下载安装OpenMPI 在普通的多核 Linux 机器上非常轻松地运行。 洛斯阿拉莫斯 的大多数系统都在使用 OpenMPI。 利弗莫尔使用mvapich 在他们的 Linux 集群上。 在深入研究之前您应该记住的是,MPI 是为解决分布式内存系统上的大规模科学问题而设计的。 您正在处理的多核机器可能具有共享内存。
默认情况下,OpenMPI 和其他实现使用共享内存进行本地消息传递,因此在将消息传递到本地进程时不必担心网络开销。 它非常透明,我不确定其他发帖者是从哪里得到他们对高开销的担忧的。 需要注意的是,MPI 并不是在单个多核机器上获得并行性的最简单工具。 在 MPI 中,所有消息传递都是显式的。 因此它被称为并行编程的“汇编语言”。 如果您不是经验丰富的 HPC 人员,则进程之间的显式通信并不容易,并且还有其他更适合共享内存的范例(UPC、OpenMP,以及像 Erlang 这样的好语言很少)你可以先尝试一下。
如果您预计编写一个可能需要多个机器来解决的并行应用程序,我的建议是使用 MPI。 您将能够使用常规多核机器进行测试并正常运行,并且一旦您让它在集群中工作,迁移到集群将非常轻松。 如果您正在编写只需要一台机器的应用程序,请尝试其他方法。 有更简单的方法来利用这种并行性。
最后,如果您确实喜欢冒险,请尝试将 MPI 与线程、OpenMP 或其他一些本地共享内存范例结合使用。 您可以使用 MPI 进行分布式消息传递,并使用其他方法实现节点上的并行性。 这就是大型机器的发展方向; 未来拥有数十万个或更多处理器的机器预计将拥有可扩展到所有节点但不是所有核心的 MPI 实现,而 HPC 人们将被迫构建混合应用程序。 这不适合胆小的人,在这个领域有一个公认的范例之前还有很多工作要做。
I've used MPI extensively on large clusters with multi-core nodes. I'm not sure if it's the right thing for a single multi-core box, but if you anticipate that your code may one day scale larger than a single chip, you might consider implementing it in MPI. Right now, nothing scales larger than MPI. I'm not sure where the posters who mention unacceptable overheads are coming from, but I've tried to give an overview of the relevant tradeoffs below. Read on for more.
MPI is the de-facto standard for large-scale scientific computation and it's in wide use on multicore machines already. It is very fast. Take a look at the most recent Top 500 list. The top machines on that list have, in some cases, hundreds of thousands of processors, with multi-socket dual- and quad-core nodes. Many of these machines have very fast custom networks (Torus, Mesh, Tree, etc) and optimized MPI implementations that are aware of the hardware.
If you want to use MPI with a single-chip multi-core machine, it will work fine. In fact, recent versions of Mac OS X come with OpenMPI pre-installed, and you can download an install OpenMPI pretty painlessly on an ordinary multi-core Linux machine. OpenMPI is in use at Los Alamos on most of their systems. Livermore uses mvapich on their Linux clusters. What you should keep in mind before diving in is that MPI was designed for solving large-scale scientific problems on distributed-memory systems. The multi-core boxes you are dealing with probably have shared memory.
OpenMPI and other implementations use shared memory for local message passing by default, so you don't have to worry about network overhead when you're passing messages to local processes. It's pretty transparent, and I'm not sure where other posters are getting their concerns about high overhead. The caveat is that MPI is not the easiest thing you could use to get parallelism on a single multi-core box. In MPI, all the message passing is explicit. It has been called the "assembly language" of parallel programming for this reason. Explicit communication between processes isn't easy if you're not an experienced HPC person, and there are other paradigms more suited for shared memory (UPC, OpenMP, and nice languages like Erlang to name a few) that you might try first.
My advice is to go with MPI if you anticipate writing a parallel application that may need more than a single machine to solve. You'll be able to test and run fine with a regular multi-core box, and migrating to a cluster will be pretty painless once you get it working there. If you are writing an application that will only ever need a single machine, try something else. There are easier ways to exploit that kind of parallelism.
Finally, if you are feeling really adventurous, try MPI in conjunction with threads, OpenMP, or some other local shared-memory paradigm. You can use MPI for the distributed message passing and something else for on-node parallelism. This is where big machines are going; future machines with hundreds of thousands of processors or more are expected to have MPI implementations that scale to all nodes but not all cores, and HPC people will be forced to build hybrid applications. This isn't for the faint of heart, and there's a lot of work to be done before there's an accepted paradigm in this space.