在运行令人尴尬的并行作业时,避免并行文件系统过载的最佳方法是什么?
我们有一个令人尴尬的并行问题 - 我们运行单个程序的大量实例,每个实例都有不同的数据集;我们只需将应用程序多次提交到批处理队列,每次使用不同的参数即可完成此操作。
然而,由于工作数量众多,并非所有工作都完成。队列中似乎没有问题 - 所有作业都已启动。
问题似乎是,随着应用程序的大量实例运行,许多作业大致在同一时间完成,因此所有作业几乎同时尝试将其数据写入并行文件系统。
那么问题似乎是程序无法写入文件系统并以某种方式崩溃,或者只是坐在那里等待写入,并且批处理队列系统在等待太长时间后杀死了作业。 (根据我收集的有关该问题的信息,大多数无法完成的作业(如果不是全部的话)都不会留下核心文件)
安排磁盘写入以避免此问题的最佳方法是什么?我提到我们的程序是令人尴尬的并行,以强调每个进程不知道其他进程的事实 - 它们无法相互交谈以以某种方式安排它们的写入。
虽然我有该程序的源代码,但我们希望解决问题而无需修改它(如果可能的话),因为我们不维护或开发它(而且大多数评论都是意大利语)。
我对此事有一些想法:
- 每个作业首先写入节点的本地(暂存)磁盘。然后,我们可以运行另一个作业,该作业时不时地检查哪些作业已完成,并将文件从本地磁盘移动到并行文件系统。
- 在主/从系统中使用 MPI 包装程序,其中主服务器管理作业队列并将其分配给每个从服务器;从属包装器运行应用程序并捕获异常(对于 C++ 或 Java 中的文件系统超时,我可以可靠地执行此操作吗?),并向主控器发送消息以重新运行作业
。需要纠缠我的主管以获取有关错误本身的更多信息 - 我个人从未遇到过它,但我还没有必要将该程序用于大量数据集(还)。
如果它有用的话:我们在带有 SGE (Sun GridEngine) 批处理队列系统的 HPC 系统上运行 Solaris。文件系统是NFS4,存储服务器也运行Solaris。 HPC 节点和存储服务器通过光纤通道链路进行通信。
We have a problem which is embarrassingly parallel - we run a large number of instances of a single program with a different data set for each; we do this simply by submitting the application many times to the batch queue with different parameters each time.
However with a large number of jobs, not all of them complete. It does not appear to be a problem in the queue - all of the jobs are started.
The issue appears to be that with a large number of instances of the application running, lots of jobs finish at roughly the same time and thus all try to write out their data to the parallel file-system at pretty much the same time.
The issue then seems to be that either the program is unable to write to the file-system and crashes in some manner, or just sits there waiting to write and the batch queue system kills the job after it's been sat waiting too long. (From what I have gathered on the problem, most of the jobs that fail to complete, if not all, do not leave core files)
What is the best way to schedule disk-writes to avoid this problem? I mention our program is embarrassingly parallel to highlight the fact the each process is not aware of the others - they cannot talk to each other to schedule their writes in some manner.
Although I have the source-code for the program, we'd like to solve the problem without having to modify this if possible as we don't maintain or develop it (plus most of the comments are in Italian).
I have had some thoughts on the matter:
- Each job write to the local (scratch) disk of the node at first. We can then run another job which checks every now and then what jobs have completed and moves the files from the local disks to the parallel file-system.
- Use an MPI wrapper around the program in master/slave system, where the master manages a queue of jobs and farms these off to each slave; and the slave wrapper runs the applications and catches the exception (could I do this reliably for a file-system timeout in C++, or possibly Java?), and sends a message back to the master to re-run the job
In the meantime I need to pester my supervisors for more information on the error itself - I've never run into it personally, but I haven't had to use the program for a very large number of datasets (yet).
In case it's useful: we run Solaris on our HPC system with the SGE (Sun GridEngine) batch queue system. The file-system is NFS4, and the storage servers also run Solaris. The HPC nodes and storage servers communicate over fibre channel links.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
大多数并行文件系统,特别是超级计算中心的并行文件系统,都是针对 HPC 应用程序的,而不是串行农场类型的东西。因此,它们煞费苦心地针对带宽进行了优化,而不是针对 IOP(每秒 I/O 操作)——也就是说,它们的目标是编写少量庞大文件的大型(1000+进程)作业,而不是无数的小文件。输出数十亿个小文件的作业。用户很容易在桌面上运行运行良好的程序,然后天真地扩展到数百个并发作业,从而使系统的 IOP 不足,从而将他们的作业和通常其他作业挂在同一系统上。
您在这里可以做的主要事情是聚合、聚合、聚合。最好您能告诉我们您在哪里运行,以便我们获得有关系统的更多信息。但有一些经过验证的策略:
上述建议将有利于任何地方代码的 I/O 性能,而不仅仅是在并行文件系统上。 IO 在任何地方都很慢,在内存中执行的操作越多,执行的实际 IO 操作越少,速度就会越快。有些系统可能比其他系统更敏感,因此您可能不会在笔记本电脑上注意到它那么多,但这会有所帮助。
同样,减少大文件而不是大量小文件将加快文件系统上从目录列表到备份的一切速度;一切都很好。
Most parallel file systems, particularly those at supercomputing centres, are targetted for HPC applications, rather than serial-farm type stuff. As a result, they're painstakingly optimized for bandwidth, not for IOPs (I/O operations per sec) - that is, they are aimed at big (1000+ process) jobs writing a handful of mammoth files, rather than zillions of little jobs outputting octillions of tiny little files. It is all to easy for users to run something that runs fine(ish) on their desktop and naively scale up to hundreds of simultaneous jobs to starve the system of IOPs, hanging their jobs and typically others on the same systems.
The main thing you can do here is aggregate, aggregate, aggregate. It would be best if you could tell us where you're running so we can get more information on the system. But some tried-and-true strategies:
The above suggestions will benefit the I/O performance of your code everywhere, not juston parallel file systems. IO is slow everywhere, and the more you can do in memory and the fewer actual IO operations you execute, the faster it will go. Some systems may be more sensitive than others, so you may not notice it so much on your laptop, but it will help.
Similarly, having fewer big files rather than many small files will speed up everything from directory listings to backups on your filesystem; it is good all around.
如果您不知道到底是什么原因导致崩溃,则很难做出决定。如果您认为这是与文件系统性能相关的错误,您可以尝试分布式文件系统: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html
如果你想实现Master/Slave系统,也许Hadoop可以是答案。
但首先我会尝试找出导致崩溃的原因......
It is hard to decide if you don't know what exactly causes the crash. If you think it is an error related to the filesystem performance, you can try an distributed filesystem: http://hadoop.apache.org/common/docs/r0.20.0/hdfs_user_guide.html
If you want to implement Master/Slave system, maybe Hadoop can be the answer.
But first of all I would try to find out what causes the crash...
当操作系统耗尽资源时,它们并不总是表现良好;有时,它们只是中止请求操作系统无法提供的第一个资源单元的进程。许多操作系统都有文件句柄资源限制(我认为Windows有几千个句柄资源,在像您这样的情况下您可能会遇到这种情况),而未能找到空闲句柄通常意味着操作系统对请求进程做了坏事。
一个需要更改程序的简单解决方案是同意您的许多作业中不能同时写入超过 N 个作业。您需要一个所有作业都可以看到的共享信号量;大多数操作系统都会为您提供相应的设施,通常作为命名资源(!)。在启动任何作业之前将信号量初始化为 N。
让每个写入作业在即将写入时从信号量获取一个资源单元,并在完成时释放该资源单元。完成此操作的代码量应该是在高度并行应用程序中插入一次的几行代码。然后你调整 N 直到不再有问题。 N==1 肯定会解决这个问题,而且你可能可以做得更好。
OSes don't alway behave nicely when they run out of resources; sometimes they simply abort the process that asks for the first unit of resource the OS can't provide. Many OSes have file handle resource limits (Windows I think has a several-thousand handle resource, which you can bump up against in circumstances like yours), and failure to find a free handle usually means the OS does bad things to the requesting process.
One simple solution requiring a program change, is to agree that no more than N of your many jobs can be writing at once. You'll need a shared semaphore that all jobs can see; most OSes will provide you with facilities for one, often as a named resource (!). Initialize the semaphore to N before you launch any job.
Have each writing job acquire a resource unit from the semaphore when the job is about to write, and release that resource unit when it is done. The amount of code to accomplish this should be a handful of lines inserted once into your highly parallel application. Then you tune N until you no longer have the problem. N==1 will surely solve it, and you can presumably do lots better than that.