ORTE_ERROR_LOG：文件 odls_default_module.c 第 621 行达到了进程可以打开的管道数量的系统限制

发布于 2025-01-12 23:32:55 字数 907 浏览 6 评论 0原文

我正在开发我的项目，我在其中使用 CUDA-Aware MPI。我基本上有两种不同大小的数据集，均为 CSV 格式（让我们考虑一下我有小型和大型数据集）。小数据集有 20 行，大数据集有 376 行。我根据该特定数据集中的行数创建进程数。

PC 规格

CPU：Intel® Xeon(R) Silver 4114 CPU @ 2.20GHz × 40
操作系统：Ubuntu 21.10

当我尝试使用较小的数据集运行我的程序时，它工作得很好（我创建了 20 个不同的进程来工作）。

但是当我尝试使用更大的数据集运行它时，它要求我使用 --oversubscribe

所以在使用 --oversubscribe 后它给了我这个错误：

[lenovo-ThinkStation-P720:136633] [[13842,0],0] ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached in file odls_default_module.c at line 621

上面的错误发生了 44 次，然后出现

mpirun: Forwarding signal 18 to job

并且停止执行并且不退出。

所以我尝试用简单的 MPI 程序复制它（只是打印进程的排名），所以我基本上进行了尝试和错误，直到它能够正常工作为止。我已经超额订阅，直到 -np 272，之后它无法运行并给我同样的错误，

我不确定，但我可能认为每个处理器上的负载太大。

我只是想知道为什么它会失败以及我应该做什么！

任何帮助表示赞赏。

谢谢！！

原文

I am working on my project, I am using CUDA-Aware MPI in it. I basically have two different sizes of datasets, which is in CSV format (Let's just consider that I have small and large Dataset). Small dataset has 20 rows in it and Larger Dataset has 376 rows. I create the number of processes based on the number of rows in that particular dataset.

PC Specifications

CPU : Intel® Xeon(R) Silver 4114 CPU @ 2.20GHz × 40
OS : Ubuntu 21.10

When I try to run my program with smaller dataset it works perfectly fine(I created 20 different processes to work).

But when I try to run it with larger dataset it asked me to use --oversubscribe

So after using --oversubscribe it gives me this error :

[lenovo-ThinkStation-P720:136633] [[13842,0],0] ORTE_ERROR_LOG: The system limit on number of pipes a process can open was reached in file odls_default_module.c at line 621

the above error occurs 44 times, and then this appears

mpirun: Forwarding signal 18 to job

and stops executing and doesn't exit.

So I have tried to replicate it with simple MPI program (just prints the rank of the process) , So I basically did trail and error till where its going to work good. I have oversubscribed till -np 272 and after that it fails to run and gives me same error

I am not sure but I probably think it's to much load on each processor.

I just want to know that why does it fail and what I should be doing instead !

Any help is appreciated.

Thanks !!

分享到QQ

分享到微博