ORTE_ERROR_LOG:文件 odls_default_module.c 第 621 行达到了进程可以打开的管道数量的系统限制
我正在开发我的项目,我在其中使用 CUDA-Aware MPI。我基本上有两种不同大小的数据集,均为 CSV 格式(让我们考虑一下我有小型和大型数据集)。小数据集有 20 行,大数据集有 376 行。我根据该特定数据集中的行数创建进程数。
PC 规格
- CPU:Intel® Xeon(R) Silver 4114 CPU @ 2.20GHz × 40
- 操作系统:Ubuntu 21.10
当我尝试使用较小的数据集运行我的程序时,它工作得很好(我创建了 20 个不同的进程来工作)。
但是当我尝试使用更大的数据集运行它时,它要求我使用 --oversubscribe
所以在使用 --oversubscribe
后它给了我这个错误:
[lenovo-ThinkStation-P720:136633] [[13842,0],0] ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached in file odls_default_module.c at line 621
上面的错误发生了 44 次,然后出现
mpirun: Forwarding signal 18 to job
并且停止执行并且不退出。
所以我尝试用简单的 MPI 程序复制它(只是打印进程的排名),所以我基本上进行了尝试和错误,直到它能够正常工作为止。我已经超额订阅,直到 -np 272,之后它无法运行并给我同样的错误,
我不确定,但我可能认为每个处理器上的负载太大。
我只是想知道为什么它会失败以及我应该做什么!
任何帮助表示赞赏。
谢谢 !!
I am working on my project, I am using CUDA-Aware MPI in it. I basically have two different sizes of datasets, which is in CSV format (Let's just consider that I have small and large Dataset). Small dataset has 20 rows in it and Larger Dataset has 376 rows. I create the number of processes based on the number of rows in that particular dataset.
PC Specifications
- CPU : Intel® Xeon(R) Silver 4114 CPU @ 2.20GHz × 40
- OS : Ubuntu 21.10
When I try to run my program with smaller dataset it works perfectly fine(I created 20 different processes to work).
But when I try to run it with larger dataset it asked me to use --oversubscribe
So after using --oversubscribe
it gives me this error :
[lenovo-ThinkStation-P720:136633] [[13842,0],0] ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached in file odls_default_module.c at line 621
the above error occurs 44 times, and then this appears
mpirun: Forwarding signal 18 to job
and stops executing and doesn't exit.
So I have tried to replicate it with simple MPI program (just prints the rank of the process) , So I basically did trail and error till where its going to work good. I have oversubscribed till -np 272 and after that it fails to run and gives me same error
I am not sure but I probably think it's to much load on each processor.
I just want to know that why does it fail and what I should be doing instead !
Any help is appreciated.
Thanks !!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在具有 384 个线程(设置了 -use-hwthread-cpus)的双插槽机器上遇到了这个问题。
ulimit -n 2048
允许测试运行。这里默认是1024。Ran into this on a dual socket machine with 384 threads (with -use-hwthread-cpus set).
ulimit -n 2048
allowed the test to run. The default here is 1024.