Python 如何在我被 LSF 分配了 4 个核心的集群上看到 12 个 cpu?
我访问一个 Linux 集群,其中使用 LSF 分配资源,我认为这是一个常用工具,来自 Scali (http://www.scali.com/workload-management/high-performance-computing)。在交互式队列中,我询问并得到了最大核心数:4。但是如果我检查Python的多处理模块看到有多少个CPU,数字是12,这是我分配到的节点的物理核心数。看起来多处理模块在遵守 LSF 应该/将施加的界限方面存在问题。这是 LSF 或 Python 中的问题吗?
[lsandor@iliadaccess03 peers_prisons]$ bsub -Is -n 4 -q interact sh
Job <7408231> is submitted to queue <interact>.
<<Waiting for dispatch ...>>
<<Starting on heroint5>>
sh-3.2$ python3
Python 3.2 (r32:88445, Jun 13 2011, 09:20:03)
[GCC 4.3.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>>
>>> multiprocessing.cpu_count()
12
I access a Linux cluster where resources are allocated using LSF, which I think is a common tool and comes from Scali (http://www.scali.com/workload-management/high-performance-computing). In an interactive queue, I asked for and got the maximum number of cores: 4. But if I check how many cpus does Python's multiprocessing module see, the number is 12, the number of physical cores the node I was allocated to has. It looks like the multiprocessing module has problems respecting the bounds that LSF should/would impose. Is this a problem in LSF or Python?
[lsandor@iliadaccess03 peers_prisons]$ bsub -Is -n 4 -q interact sh
Job <7408231> is submitted to queue <interact>.
<<Waiting for dispatch ...>>
<<Starting on heroint5>>
sh-3.2$ python3
Python 3.2 (r32:88445, Jun 13 2011, 09:20:03)
[GCC 4.3.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import multiprocessing
>>>
>>> multiprocessing.cpu_count()
12
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没问题,尽管您的程序应该考虑排队系统分配给它的资源量,正如您所意识到的,该资源量可能远低于 100%。我不相信 LSF 有操作系统级别的钩子来强制合规性,也不应该这样做。
过去我见过用包装脚本处理这个问题。一种通过适当的设置同时设置程序和作业,然后启动它的方法。
Not a problem, although your program should respect the amount of resources allocated to it by the queuing system, which may be considerably less than 100% as you have realized. I don't believe LSF has OS-level hooks to enforce compliance, nor probably should it.
In the past I've seen this handled with a wrapper script. One that that sets up a program and job simultaneously with the appropriate settings, then launches it.
聚会有点晚了,但扩展了@Paddy3118的答案,不需要跨度规范。相反,环境变量
LSB_DJOB_NUMPROC
保存分配的核心数量。至少我可用的 LSF 版本 (9.1.2) 是这样。A bit late to the party, but expanding on the answer of @Paddy3118, the span specification is not needed. Instead, the environment variable
LSB_DJOB_NUMPROC
holds the number of allocated cores. At least it does with the LSF version available to me (9.1.2).如果您使用 -n 选项提交到 lsf 来说明您需要多少个处理器,然后使用
span
请求在同一主机上提供四个处理器,如下面的命令所示:然后是 my_job使用以下环境变量集启动,您的 python 脚本可以查询这些环境变量集,以将要启动的子进程数设置为等于 LSF 分配的数量:(
或者子进程数应该是由 LSF 分配的进程数LSF - 1 占启动子进程的 python 脚本:-)
If you submit to lsf using the -n option to state how many processors you want and then use request that the four processors are made available on the same host by using
span
like in the command below:Then my_job is started with the following environment variables set which can be interrogated by your python script to set the number of sub-processes to start equal to the number assigned by LSF:
(Or should the number of sub-processes be the number of processes allocated by LSF - 1 to account for the python script launching the sub-processes :-)