“令人尴尬地平行”在集群上使用 python 和 PBS 进行编程

发布于 2024-09-10 22:26:34 字数 928 浏览 16 评论 0原文

我有一个可以生成数字的函数（神经网络模型）。我希望在带有 Torque 的标准集群上使用 PBS 来测试来自 python 的多个参数、方法和不同输入（意味着数百次函数运行）。

注意：我尝试了parallelpython、ipython 等，但从未完全满意，因为我想要更简单的东西。集群处于给定的配置中，我无法更改，这种集成 python + qsub 的解决方案肯定会给社区带来好处。

为了简化事情，我有一个简单的函数，例如：

import myModule
def model(input, a= 1., N=100):
    do_lots_number_crunching(input, a,N)
    pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')

其中 input 是表示输入的对象，input.name 是一个字符串，do_lots_number_crunching > 可能会持续几个小时。

我的问题是：是否有正确的方法可以将诸如参数扫描之类的内容转换

for a in pylab.linspace(0., 1., 100):
    model(input, a)

为“某些内容”，以便为每次调用 model 函数启动 PBS 脚本？

#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py

我正在考虑一个包含 PBS 模板并从 python 脚本调用它的函数，但还无法弄清楚（装饰器？）。

原文

I have a function (neural network model) which produces figures. I wish to test several parameters, methods and different inputs (meaning hundreds of runs of the function) from python using PBS on a standard cluster with Torque.

Note: I tried parallelpython, ipython and such and was never completely satisfied, since I want something simpler. The cluster is in a given configuration that I cannot change and such a solution integrating python + qsub will certainly benefit to the community.

To simplify things, I have a simple function such as:

import myModule
def model(input, a= 1., N=100):
    do_lots_number_crunching(input, a,N)
    pylab.savefig('figure_' + input.name + '_' + str(a) + '_' + str(N) + '.png')

where input is an object representing the input, input.name is a string, anddo_lots_number_crunching may last hours.

My question is: is there a correct way to transform something like a scan of parameters such as

for a in pylab.linspace(0., 1., 100):
    model(input, a)

into "something" that would launch a PBS script for every call to the model function?

#PBS -l ncpus=1
#PBS -l mem=i1000mb
#PBS -l cput=24:00:00
#PBS -V
cd /data/work/
python experiment_model.py

I was thinking of a function that would include the PBS template and call it from the python script, but could not yet figure it out (decorator?).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

毅然前行 2024-09-17 22:26:34

pbs_python[1] 可以用于此目的。如果experiment_model.py 'a'作为参数，你可以这样做

import pbs, os

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)

attopl = pbs.new_attropl(4)
attropl[0].name  = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'

attropl[1].name  = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'

attropl[2].name  = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'

attrop1[3].name = pbs.ATTR_V

script='''
cd /data/work/
python experiment_model.py %f
'''

jobs = []

for a in pylab.linspace(0.,1.,100):
    script_name = 'experiment_model.job' + str(a)
    with open(script_name,'w') as scriptf:
        scriptf.write(script % a)
    job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
    jobs.append(job_id)
    os.remove(script_name)

 print jobs

[1]： https:/ /oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

pbs_python[1] could work for this. If experiment_model.py 'a' as an argument you could do

import pbs, os

server_name = pbs.pbs_default()
c = pbs.pbs_connect(server_name)

attopl = pbs.new_attropl(4)
attropl[0].name  = pbs.ATTR_l
attropl[0].resource = 'ncpus'
attropl[0].value = '1'

attropl[1].name  = pbs.ATTR_l
attropl[1].resource = 'mem'
attropl[1].value = 'i1000mb'

attropl[2].name  = pbs.ATTR_l
attropl[2].resource = 'cput'
attropl[2].value = '24:00:00'

attrop1[3].name = pbs.ATTR_V

script='''
cd /data/work/
python experiment_model.py %f
'''

jobs = []

for a in pylab.linspace(0.,1.,100):
    script_name = 'experiment_model.job' + str(a)
    with open(script_name,'w') as scriptf:
        scriptf.write(script % a)
    job_id = pbs.pbs_submit(c, attropl, script_name, 'NULL', 'NULL')
    jobs.append(job_id)
    os.remove(script_name)

 print jobs

[1]: https://oss.trac.surfsara.nl/pbs_python/wiki/TorqueUsage pbs_python

回复收藏 0 原文

输什么也不输骨气 2024-09-17 22:26:34

您可以使用 jug （我为类似的设置开发）轻松完成此操作。

您可以在文件中写入（例如，model.py）：

@TaskGenerator
def model(param1, param2):
     res = complex_computation(param1, param2)
     pyplot.coolgraph(res)


for param1 in np.linspace(0, 1.,100):
    for param2 in xrange(2000):
        model(param1, param2)

就是这样！

现在，您可以在队列上启动“jug jobs”：jugexecute model.py，这将自动并行化。发生的情况是，每个作业都会在一个循环中执行以下操作：（

while not all_done():
    for t in tasks in tasks_that_i_can_run():
        if t.lock_for_me(): t.run()

实际上比这更复杂，但你明白了）。

它使用文件系统进行锁定（如果您在 NFS 系统上）或使用 redis 服务器（如果您愿意）。它还可以处理任务之间的依赖关系。

这并不完全是您所要求的，但我相信将其与作业排队系统分开是一个更干净的架构。

You can do this easily using jug (which I developed for a similar setup).

You'd write in file (e.g., model.py):

@TaskGenerator
def model(param1, param2):
     res = complex_computation(param1, param2)
     pyplot.coolgraph(res)


for param1 in np.linspace(0, 1.,100):
    for param2 in xrange(2000):
        model(param1, param2)

And that's it!

Now you can launch "jug jobs" on your queue: jug execute model.py and this will parallelise automatically. What happens is that each job will in, a loop, do something like:

while not all_done():
    for t in tasks in tasks_that_i_can_run():
        if t.lock_for_me(): t.run()

(It's actually more complicated than that, but you get the point).

It uses the filesystem for locking (if you're on an NFS system) or a redis server if you prefer. It can also handle dependencies between tasks.

This is not exactly what you asked for, but I believe it's a cleaner architechture to separate this from the job queueing system.

回复收藏 0 原文

旧竹 2024-09-17 22:26:34

看起来我来得有点晚了，但几年前我也有同样的问题，即如何将令人尴尬的并行问题映射到 python 中的集群上，并编写了自己的解决方案。我最近将其上传到github： https://github.com/plediii/pbs_util

编写你的程序使用 pbs_util，我首先会在工作目录中创建一个 pbs_util.ini，其中包含

[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00

然后像这样的 python 脚本

import pbs_util.pbs_map as ppm

import pylab
import myModule

class ModelWorker(ppm.Worker):

    def __init__(self, input, N):
        self.input = input
        self.N = N

    def __call__(self, a):
        myModule.do_lots_number_crunching(self.input, a, self.N)
        pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')



# You need  "main" protection like this since pbs_map will import this file on the     compute nodes
if __name__ == "__main__":
    input, N = something, picklable
    # Use list to force the iterator
    list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
                     startup_args=(input, N),
                     num_clients=100))

就可以了。

It looks like I'm a little late to the party, but I also had the same question of how to map embarrassingly parallel problems onto a cluster in python a few years ago and wrote my own solution. I recently uploaded it to github here: https://github.com/plediii/pbs_util

To write your program with pbs_util, I would first create a pbs_util.ini in the working directory containing

[PBSUTIL]
numnodes=1
numprocs=1
mem=i1000mb
walltime=24:00:00

Then a python script like this

import pbs_util.pbs_map as ppm

import pylab
import myModule

class ModelWorker(ppm.Worker):

    def __init__(self, input, N):
        self.input = input
        self.N = N

    def __call__(self, a):
        myModule.do_lots_number_crunching(self.input, a, self.N)
        pylab.savefig('figure_' + self.input.name + '_' + str(a) + '_' + str(self.N) + '.png')



# You need  "main" protection like this since pbs_map will import this file on the     compute nodes
if __name__ == "__main__":
    input, N = something, picklable
    # Use list to force the iterator
    list(ppm.pbs_map(ModelWorker, pylab.linspace(0., 1., 100),
                     startup_args=(input, N),
                     num_clients=100))

And that would do it.

回复收藏 0 原文

作死小能手 2024-09-17 22:26:34

我刚刚开始使用集群和 EP 应用程序。我的目标（我在图书馆工作）是学习足够的知识，帮助校园内的其他研究人员通过 EP 应用程序访问 HPC……尤其是 STEM 以外的研究人员。我还是个新手，但认为指出 GNU Parallel 在 PBS 脚本中启动带有不同参数的基本 python 脚本。在 .pbs 文件中，有两行需要指出：

module load gnu-parallel # this is required on my environment

parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*

# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}`   this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.' 
#     will be substituted into the python call as the second(3rd) argument where the
#     `{}` resides.  These can be simple text files that you use in your 'simple.py'
#     script to pass the parameter sets, filenames, etc.

作为 EP 超级计算的新手，即使我还不了解“并行”上的所有其他选项，该命令允许我与不同的并行启动 python 脚本参数。如果您可以提前生成大量参数文件来并行化您的问题，那么这将非常有效。例如，跨参数空间运行模拟。或者用相同的代码处理许多文件。

I just started working with clusters and EP applications. My goal (I'm with the Library) is to learn enough to help other researchers on campus access HPC with EP applications...especially researchers outside of STEM. I'm still very new, but thought it may help this question to point out the use of GNU Parallel in a PBS script to launch basic python scripts with varying arguments. In the .pbs file, there are two lines to point out:

module load gnu-parallel # this is required on my environment

parallel -j 4 --env PBS_O_WORKDIR --sshloginfile $PBS_NODEFILE \
--workdir $NODE_LOCAL_DIR --transfer --return 'output.{#}' --clean \
`pwd`/simple.py '{#}' '{}' ::: $INPUT_DIR/input.*

# `-j 4` is the number of processors to use per node, will be cluster-specific
# {#} will substitute the process number into the string
# `pwd`/simple.py `{#}` `{}`   this is the command that will be run multiple times
# ::: $INPUT_DIR/input.* all of the files in $INPUT_DIR/ that start with 'input.' 
#     will be substituted into the python call as the second(3rd) argument where the
#     `{}` resides.  These can be simple text files that you use in your 'simple.py'
#     script to pass the parameter sets, filenames, etc.

As a newby to EP supercomputing, even though I don't yet understand all the other options on "parallel", this command allowed me to launch python scripts in parallel with different parameters. This would work well if you can generate a slew of parameter files ahead of time that will parallelize your problem. For example, running simulations across a parameter space. Or processing many files with the same code.

回复收藏 0 原文

~没有更多了~