Python 如何在我被 LSF 分配了 4 个核心的集群上看到 12 个 cpu？

发布于 2024-12-05 04:25:42 字数 834 浏览 1 评论 0原文

我访问一个 Linux 集群，其中使用 LSF 分配资源，我认为这是一个常用工具，来自 Scali (http://www.scali.com/workload-management/high-performance-computing）。在交互式队列中，我询问并得到了最大核心数：4。但是如果我检查Python的多处理模块看到有多少个CPU，数字是12，这是我分配到的节点的物理核心数。看起来多处理模块在遵守 LSF 应该/将施加的界限方面存在问题。这是 LSF 或 Python 中的问题吗？

[lsandor@iliadaccess03 peers_prisons]$ bsub -Is -n 4 -q interact sh
Job <7408231> is submitted to queue <interact>.
<<Waiting for dispatch ...>>
<<Starting on heroint5>>
sh-3.2$ python3
Python 3.2 (r32:88445, Jun 13 2011, 09:20:03) 
[GCC 4.3.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> 
>>> multiprocessing.cpu_count()
12

原文

I access a Linux cluster where resources are allocated using LSF, which I think is a common tool and comes from Scali (http://www.scali.com/workload-management/high-performance-computing). In an interactive queue, I asked for and got the maximum number of cores: 4. But if I check how many cpus does Python's multiprocessing module see, the number is 12, the number of physical cores the node I was allocated to has. It looks like the multiprocessing module has problems respecting the bounds that LSF should/would impose. Is this a problem in LSF or Python?

[lsandor@iliadaccess03 peers_prisons]$ bsub -Is -n 4 -q interact sh
Job <7408231> is submitted to queue <interact>.
<<Waiting for dispatch ...>>
<<Starting on heroint5>>
sh-3.2$ python3
Python 3.2 (r32:88445, Jun 13 2011, 09:20:03) 
[GCC 4.3.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>> 
>>> multiprocessing.cpu_count()
12

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

裸钻 2024-12-12 04:25:42

没问题，尽管您的程序应该考虑排队系统分配给它的资源量，正如您所意识到的，该资源量可能远低于 100%。我不相信 LSF 有操作系统级别的钩子来强制合规性，也不应该这样做。

过去我见过用包装脚本处理这个问题。一种通过适当的设置同时设置程序和作业，然后启动它的方法。

回复收藏 0 原文

一曲爱恨情仇 2024-12-12 04:25:42

聚会有点晚了，但扩展了@Paddy3118的答案，不需要跨度规范。相反，环境变量 LSB_DJOB_NUMPROC 保存分配的核心数量。至少我可用的 LSF 版本 (9.1.2) 是这样。

回复收藏 0 原文

甜宝宝 2024-12-12 04:25:42

如果您使用 -n 选项提交到 lsf 来说明您需要多少个处理器，然后使用 span 请求在同一主机上提供四个处理器，如下面的命令所示：

bsub -n 4 -R "span[hosts=1]" my_job

然后是 my_job使用以下环境变量集启动，您的 python 脚本可以查询这些环境变量集，以将要启动的子进程数设置为等于 LSF 分配的数量：（

LSB_HOSTS= "hostA hostA hostA hostA"
LSB_MCPU_HOSTS="hostA 4"

或者子进程数应该是由 LSF 分配的进程数LSF - 1 占启动子进程的 python 脚本:-)

If you submit to lsf using the -n option to state how many processors you want and then use request that the four processors are made available on the same host by using span like in the command below:

bsub -n 4 -R "span[hosts=1]" my_job

Then my_job is started with the following environment variables set which can be interrogated by your python script to set the number of sub-processes to start equal to the number assigned by LSF:

LSB_HOSTS= "hostA hostA hostA hostA"
LSB_MCPU_HOSTS="hostA 4"

(Or should the number of sub-processes be the number of processes allocated by LSF - 1 to account for the python script launching the sub-processes :-)

回复收藏 0 原文

~没有更多了~