在 GridEngine 集群的多个节点上运行作业
我可以访问一个 128 核集群,我想在该集群上运行并行作业。该集群使用 Sun GridEngine,我的程序编写为在 Python 2.5.8 上使用并行 Python、numpy、scipy 运行。在单节点(4 核)上运行作业的性能比单核提高约 3.5 倍。我现在想将其提升到一个新的水平,并将该作业拆分到约 4 个节点上。我的 qsub 脚本看起来像这样:
#!/bin/bash
# The name of the job, can be whatever makes sense to you
#$ -N jobname
# The job should be placed into the queue 'all.q'.
#$ -q all.q
# Redirect output stream to this file.
#$ -o jobname_output.dat
# Redirect error stream to this file.
#$ -e jobname_error.dat
# The batchsystem should use the current directory as working directory.
# Both files will be placed in the current
# directory. The batchsystem assumes to find the executable in this directory.
#$ -cwd
# request Bourne shell as shell for job.
#$ -S /bin/sh
# print date and time
date
# spython is the server's version of Python 2.5. Using python instead of spython causes the program to run in python 2.3
spython programname.py
# print date and time again
date
有人知道如何做到这一点吗?
I have access to a 128-core cluster on which I would like to run a parallelised job. The cluster uses Sun GridEngine and my program is written to run using Parallel Python, numpy, scipy on Python 2.5.8. Running the job on a single node (4-cores) yields an ~3.5x improvement over a single core. I would now like to take this to the next level and split the job across ~4 nodes. My qsub
script looks something like this:
#!/bin/bash
# The name of the job, can be whatever makes sense to you
#$ -N jobname
# The job should be placed into the queue 'all.q'.
#$ -q all.q
# Redirect output stream to this file.
#$ -o jobname_output.dat
# Redirect error stream to this file.
#$ -e jobname_error.dat
# The batchsystem should use the current directory as working directory.
# Both files will be placed in the current
# directory. The batchsystem assumes to find the executable in this directory.
#$ -cwd
# request Bourne shell as shell for job.
#$ -S /bin/sh
# print date and time
date
# spython is the server's version of Python 2.5. Using python instead of spython causes the program to run in python 2.3
spython programname.py
# print date and time again
date
Does anyone have any idea of how to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
是的,您需要在脚本中包含网格引擎选项
-np 16
,如下所示:或在提交脚本时在命令行中包含。或者,对于更永久的安排,请使用
.sge_request
文件。在我曾经使用过的所有 GE 安装上,这将为您提供 16 个处理器(或现在的处理器核心),在尽可能少的节点上,因此,如果您的节点有 4 个核心,您将获得 4 个节点,如果它们有 8 个 2 和很快。要在 8 个节点上部署 2 个核心(如果每个进程都需要大量内存,您可能需要这样做),会稍微复杂一些,您应该咨询您的支持团队。
Yes, you need to include the Grid Engine option
-np 16
either in your script like this:or on the command line when you submit the script. Or, for more permanent arrangements, use an
.sge_request
file.On all the GE installations I've ever used this will give you 16 processors (or processor cores these days) on as few nodes as necessary, so if your nodes have 4 cores you'll get 4 nodes, if they have 8 2 and so on. To place the job on, say 2 cores on 8 nodes (which you might want to do if you need a lot of memory for each process) is a little more complicated and you should consult your support team.