共享内存、MPI 和排队系统
我的 unix/windows C++ 应用程序已经使用 MPI 进行并行化:作业被分割在 N 个 cpu 中,每个块并行执行,非常高效,速度扩展非常好,作业完成得很好。
但有些数据在每个流程中都会重复,并且由于技术原因,这些数据无法通过 MPI 轻松分割(...)。 例如:
- 5 Gb 静态数据,为每个进程加载完全相同的
- 数据 可以在 MPI 中分发的 4 Gb 数据,使用的 CPU 越多,每个 CPU RAM 就越小。
在 4 个 CPU 的工作中,这意味着至少有 20Gb RAM 负载,大部分内存都“浪费”了,这太糟糕了。
我正在考虑使用共享内存来减少总体负载,“静态”块每台计算机仅加载一次。
因此,主要问题是:
是否有任何标准 MPI 方法可以在节点上共享内存?某种现成的免费库?
- 如果没有,我将使用
boost.interprocess
并使用 MPI 调用来分发本地共享内存标识符。 - 共享内存将由每个节点上的“本地主机”读取,并且共享只读。不需要任何类型的信号量/同步,因为它不会改变。
- 如果没有,我将使用
有任何性能影响或需要警惕的特定问题吗?
- (不会有任何“字符串”或过于奇怪的数据结构,一切都可以归结为数组和结构指针)
该作业将在 PBS(或 SGE)队列系统中执行,在进程不干净退出的情况下,我想知道这些是否会清理节点特定的共享内存。
My unix/windows C++ app is already parallelized using MPI: the job is splitted in N cpus and each chunk is executed in parallel, quite efficient, very good speed scaling, the job is done right.
But some of the data is repeated in each process, and for technical reasons this data cannot be easily splitted over MPI (...).
For example:
- 5 Gb of static data, exact same thing loaded for each process
- 4 Gb of data that can be distributed in MPI, the more CPUs are used, smaller this per-CPU RAM is.
On a 4 CPU job, this would mean at least a 20Gb RAM load, most of memory 'wasted', this is awful.
I'm thinking using shared memory to reduce the overall load, the "static" chunk would be loaded only once per computer.
So, main question is:
Is there any standard MPI way to share memory on a node? Some kind of readily available + free library ?
- If not, I would use
boost.interprocess
and use MPI calls to distribute local shared memory identifiers. - The shared-memory would be read by a "local master" on each node, and shared read-only. No need for any kind of semaphore/synchronization, because it wont change.
- If not, I would use
Any performance hit or particular issues to be wary of?
- (There wont be any "strings" or overly weird data structures, everything can be brought down to arrays and structure pointers)
The job will be executed in a PBS (or SGE) queuing system, in the case of a process unclean exit, I wonder if those will cleanup the node-specific shared memory.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
高性能计算 (HPC) 中一种日益常见的方法是混合 MPI/OpenMP 程序。即你有N个MPI进程,每个MPI进程有M个线程。这种方法很好地映射到由共享内存多处理器节点组成的集群。
更改为这种分层并行化方案显然需要一些或多或少的侵入性更改,OTOH 如果做得正确,除了减少复制数据的内存消耗之外,还可以提高代码的性能和可扩展性。
根据 MPI 实现,您可能能够也可能无法从所有线程进行 MPI 调用。这是由您必须调用而不是 MPI_Init() 函数的 MPI_Init_Thread() 函数的
required
和provided
参数指定的。可能的值是根据我的经验,现代 MPI 实现(例如 Open MPI)支持最灵活的 MPI_THREAD_MULTIPLE。如果您使用较旧的 MPI 库或某些专用架构,情况可能会更糟。
当然,您不需要使用 OpenMP 进行线程处理,这只是 HPC 中最流行的选项。您可以使用 Boost 线程库、Intel TBB 库或直接 pthreads 或 windows 线程。
One increasingly common approach in High Performance Computing (HPC) is hybrid MPI/OpenMP programs. I.e. you have N MPI processes, and each MPI process has M threads. This approach maps well to clusters consisting of shared memory multiprocessor nodes.
Changing to such a hierarchical parallelization scheme obviously requires some more or less invasive changes, OTOH if done properly it can increase the performance and scalability of the code in addition to reducing memory consumption for replicated data.
Depending on the MPI implementation, you may or may not be able to make MPI calls from all threads. This is specified by the
required
andprovided
arguments to the MPI_Init_Thread() function that you must call instead of MPI_Init(). Possible values areIn my experience, modern MPI implementations like Open MPI support the most flexible MPI_THREAD_MULTIPLE. If you use older MPI libraries, or some specialized architecture, you might be worse off.
Of course, you don't need to do your threading with OpenMP, that's just the most popular option in HPC. You could use e.g. the Boost threads library, the Intel TBB library, or straight pthreads or windows threads for that matter.
我没有使用过 MPI,但如果它像我见过的其他 IPC 库一样隐藏其他线程/进程/其他内容是否位于相同或不同的机器上,那么它将无法保证共享内存。是的,如果该机器本身提供共享内存,它可以处理同一台机器上两个节点之间的共享内存。但是,由于引发了复杂的一致性问题,尝试在不同机器上的节点之间共享内存充其量也是非常困难的。我希望它根本不会被实现。
实际上,如果您需要在节点之间共享内存,最好的选择是在 MPI 之外执行此操作。我认为您不需要使用 boost.interprocess 风格的共享内存,因为您没有描述不同节点对共享内存进行细粒度更改的情况;它要么是只读的,要么是分区的。
John 和 deus 的答案涵盖了如何在文件中映射,这绝对是您想要对 5 Gb(千兆位?)静态数据执行的操作。每个 CPU 的数据听起来都是一样的,您只需向每个节点发送一条消息,告诉它应该抓取文件的哪一部分。操作系统应该负责将虚拟内存映射到物理内存到文件。
至于清理...我假设它不会对共享内存进行任何清理,但是应该清理 mmap 编辑的文件,因为当进程关闭时文件会被关闭(这应该释放它们的内存映射)被清理干净了。我不知道 CreateFileMapping 等有什么警告。
当进程终止时,实际的“共享内存”(即boost.interprocess)不会被清除。如果可能的话,我建议尝试终止一个进程并查看留下的内容。
I haven't worked with MPI, but if it's like other IPC libraries I've seen that hide whether other threads/processes/whatever are on the same or different machines, then it won't be able to guarantee shared memory. Yes, it could handle shared memory between two nodes on the same machine, if that machine provided shared memory itself. But trying to share memory between nodes on different machines would be very difficult at best, due to the complex coherency issues raised. I'd expect it to simply be unimplemented.
In all practicality, if you need to share memory between nodes, your best bet is to do that outside MPI. i don't think you need to use
boost.interprocess
-style shared memory, since you aren't describing a situation where the different nodes are making fine-grained changes to the shared memory; it's either read-only or partitioned.John's and deus's answers cover how to map in a file, which is definitely what you want to do for the 5 Gb (gigabit?) static data. The per-CPU data sounds like the same thing, and you just need to send a message to each node telling it what part of the file it should grab. The OS should take care of mapping virtual memory to physical memory to the files.
As for cleanup... I would assume it doesn't do any cleanup of shared memory, but
mmap
ed files should be cleaned up since files are closed (which should release their memory mappings) when a process is cleaned up. I have no idea what caveatsCreateFileMapping
etc. have.Actual "shared memory" (i.e.
boost.interprocess
) is not cleaned up when a process dies. If possible, I'd recommend trying killing a process and seeing what is left behind.使用 MPI-2,您可以通过 MPI_Put 和 MPI_Get 等函数进行 RMA(远程内存访问)。使用这些功能(如果您的 MPI 安装支持它们)肯定会帮助您减少程序的总内存消耗。代价是增加了编码的复杂性,但这也是并行编程的乐趣之一。话又说回来,它确实让您处于 MPI 领域。
With MPI-2 you have RMA (remote memory access) via functions such as MPI_Put and MPI_Get. Using these features, if your MPI installation supports them, would certainly help you reduce the total memory consumption of your program. The cost is added complexity in coding but that's part of the fun of parallel programming. Then again, it does keep you in the domain of MPI.
MPI-3 提供共享内存窗口(参见例如
MPI_Win_allocate_shared()
),它允许使用节点上共享内存而无需任何额外的依赖项。MPI-3 offers shared memory windows (see e.g.
MPI_Win_allocate_shared()
), which allows usage of on-node shared memory without any additional dependencies.我对unix了解不多,也不知道MPI是什么。但在 Windows 中,您所描述的是文件映射对象的完全匹配。
如果此数据嵌入到它加载的 .EXE 或 .DLL 中,那么它将自动在所有进程之间共享。进程的拆卸,即使是由于崩溃,也不会导致任何数据泄漏或未释放的锁定。然而 9Gb 的 .dll 听起来有点可疑。所以这可能不适合你。
但是,您可以将数据放入文件中,然后在其上使用
CreateFileMapping
和MapViewOfFile
。映射可以是只读的,您可以将文件的全部或部分映射到内存中。所有进程将共享映射到同一底层 CreateFileMapping 对象的页面。关闭取消映射视图和关闭句柄是一种很好的做法,但如果您不这样做,操作系统会在拆卸时为您执行此操作。请注意,除非您运行的是 x64,否则您将无法将 5Gb 文件映射到单个视图(甚至 2Gb 文件,1Gb 也可以)。但考虑到你正在谈论这个已经可以工作了,我猜你已经是 x64 了。
I don't know much about unix, and I don't know what MPI is. But in Windows, what you are describing is an exact match for a file mapping object.
If this data is imbedded in your .EXE or a .DLL that it loads, then it will automatically be shared between all processes. Teardown of your process, even as a result of a crash will not cause any leaks or unreleased locks of your data. however a 9Gb .dll sounds a bit iffy. So this probably doesn't work for you.
However, you could put your data into a file, then
CreateFileMapping
andMapViewOfFile
on it. The mapping can be readonly, and you can map all or part of the file into memory. All processes will share pages that are mapped the same underlying CreateFileMapping object. it's good practice to close unmap views and close handles, but if you don't the OS will do it for you on teardown.Note that unless you are running x64, you won't be able to map a 5Gb file into a single view (or even a 2Gb file, 1Gb might work). But given that you are talking about having this already working, I'm guessing that you are already x64 only.
如果将静态数据存储在文件中,则可以在 unix 上使用 mmap 来随机访问数据。当您需要访问数据的特定位时,数据将被分页。您需要做的就是将任何二进制结构覆盖在文件数据上。这是上面提到的 CreateFileMapping 和 MapViewOfFile 的 Unix 等效项。
顺便说一句,当调用 malloc 请求多于一页的数据时,glibc 使用 mmap。
If you store your static data in a file, you can use mmap on unix to get random access to the data. Data will be paged in as and when you need access to a particular bit of the data. All that you will need to do is overlay any binary structures over the file data. This is the unix equivalent of CreateFileMapping and MapViewOfFile mentioned above.
Incidentally glibc uses mmap when one calls malloc to request more than a page of data.
我在 SHUT 与 MPI 进行了一些项目。
据我所知,使用 MPI 分配问题的方法有很多,也许您可以找到另一种不需要共享内存的解决方案,
我的项目正在解决7,000,000个方程和7,000,000个变量
如果你能解释你的问题,我会尽力帮助你
I had some projects with MPI in SHUT.
As i know , there are many ways to distribute a problem using MPI, maybe you can find another solution that does not required share memory,
my project was solving an 7,000,000 equation and 7,000,000 variable
if you can explain your problem,i would try to help you
几年前我在使用 MPI 时就遇到过这个问题。
我不确定 SGE 是否理解内存映射文件。如果您针对 beowulf 集群进行分发,我怀疑您会遇到一致性问题。您能讨论一下您的多处理器架构吗?
我的草案方法是建立一个架构,其中数据的每个部分都由定义的 CPU 拥有。将有两个线程:一个线程是 MPI 双向通话器,另一个线程用于计算结果。请注意,MPI 和线程并不总是能够很好地协同工作。
I ran into this problem in the small when I used MPI a few years ago.
I am not certain that the SGE understands memory mapped files. If you are distributing against a beowulf cluster, I suspect you're going to have coherency issues. Could you discuss a little about your multiprocessor architecture?
My draft approach would be to set up an architecture where each part of the data is owned by a defined CPU. There would be two threads: one thread being an MPI two-way talker and one thread for computing the result. Note that MPI and threads don't always play well together.