Python(或者一般的linux)文件操作流程控制或文件锁定
我正在使用一组计算机进行一些并行计算。我的主目录在整个集群中共享。在一台机器上,我有一个 ruby 代码,它创建包含计算命令的 bash 脚本,并将脚本写入 ~/q/ 目录。这些脚本被命名为 *.worker1.sh、*.worker2.sh 等。
在其他 20 台机器上,我运行了 20 个 python 代码(每台机器一个),它们(不断地)检查 ~/q/ 目录并查找作业属于该机器,使用如下 python 代码:
jobs = glob.glob('q/*.worker1.sh')
[os.system('sh ' + job + ' &') for job in jobs]
对于一些额外的控制,ruby 代码将 bash 脚本写入 q 后,将在 q 目录创建一个空文件,如 workeri.start (i = 1..20)目录中,python 代码将在运行上述代码之前检查该“start”文件。在 bash 脚本中,如果命令成功完成,bash 脚本将创建一个空文件,如“workeri.sccuess”,Python 代码在运行上述代码后检查该文件以确保计算成功完成。如果 python 发现计算成功完成,它将删除 q 目录中的“start”文件,因此 ruby 代码知道作业成功完成。当20个bash脚本全部完成后,ruby代码将创建新的bash脚本,并且python读取并执行新脚本等等。
我知道这不是协调计算的优雅方式,但我还没有找到更好的在不同机器之间进行通信的方法。
现在的问题是:我预计这 20 个作业将在某种程度上并行运行。完成 20 项工作的总时间不会比完成一项工作的时间长很多。然而,这些作业似乎是按顺序运行的,而且时间比我预期的要长得多。
我怀疑部分原因是多个代码同时读写同一个目录,但linux系统或python锁定了该目录,只允许一个进程操作该目录。这使得代码一次执行一个。
我不确定情况是否如此。如果我把bash脚本拆分到不同的目录,让不同机器上的python代码读写不同的目录,能解决问题吗?或者还有其他原因导致这个问题吗?
非常感谢您的任何建议!如果我没有解释清楚,请告诉我。
一些附加信息: 我的主目录位于 /home/my_group/my_home,这是它的挂载信息 :/vol/my_group on /home/my_group type nfs (rw,nosuid,nodev,noatime,tcp,timeo=600,retrans=2,rsize=65536,wsize=65536,addr=...)
我说不断检查q 目录,意味着像这样的 python 循环:
While True:
if 'start' file exists:
find the scripts and execute them as I mentioned above
I am using a cluster of computers to do some parallel computation. My home directory is shared across the cluster. In one machine, I have a ruby code that creates bash script containing computation command and write the script to, say, ~/q/ directory. The scripts are named *.worker1.sh, *.worker2.sh, etc.
On other 20 machines, I have 20 python code running ( one at each machine ) that (constantly) check the ~/q/ directory and look for jobs that belong to that machine, using a python code like this:
jobs = glob.glob('q/*.worker1.sh')
[os.system('sh ' + job + ' &') for job in jobs]
For some additional control, the ruby code will create a empty file like workeri.start (i = 1..20) at q directory after it write the bash script to q directory, the python code will check for that 'start' file before it runs the above code. And in the bash script, if the command finishes successfully, the bash script will create an empty file like 'workeri.sccuess', the python code checks this file after it runs the above code to make sure the computation finishs successfully. If python finds out that the computation finishs successfully, it will remove the 'start' file in q directory, so the ruby code knows that job finishs successfully. After the 20 bash script all finished, the ruby code will create new bash script and python read and executes new scripts and so on.
I know this is not a elegant way to coordinate the computation, but I haven't figured out a better to communicate between different machines.
Now the question is: I expect that the 20 jobs will run somewhat in parallel. The total time to finish the 20 jobs will not be much longer than the time to finish one job. However, it seems that these jobs runs sequentially and time is much longer than I expected.
I suspect that part of the reason is that multiple codes are reading and writing the same directory at once but the linux system or python locks the directory and only allow one process to oprate the directory. This makes the code execute one at a time.
I am not sure if this is the case. If I split the bash scripts to different directories, and let the python code on different machines read and write different directories, will that solve the problem? Or is there any other reasons that cause the problem?
Thanks a lot for any suggestions! Let me know if I didn't explain anything clearly.
Some additional info:
my home directory is at /home/my_group/my_home, here is the mount info for it
:/vol/my_group on /home/my_group type nfs (rw,nosuid,nodev,noatime,tcp,timeo=600,retrans=2,rsize=65536,wsize=65536,addr=...)
I say constantly check the q directory, meaning a python loop like this:
While True:
if 'start' file exists:
find the scripts and execute them as I mentioned above
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
虽然这不是您直接要求的,但您确实应该考虑在这个级别解决您的问题,使用某种共享消息队列可能会很多与依赖特定网络文件系统的锁定语义相比,管理和调试更简单。
根据我的经验,最简单的设置和运行解决方案是在当前运行创建作业的 Ruby 脚本的计算机上安装 redis 。它实际上应该像下载源代码、编译并启动它一样简单。 Redis 服务器启动并运行后,您可以更改代码以将计算命令附加到一个或多个 Redis 列表。在 ruby 中,您可以像这样使用 redis-rb 库:
如果计算 需要由某些机器处理,请使用每台机器的列表,如下所示:
然后在您的Python代码中,您可以使用redis-py 连接到 Redis 服务器并将作业从列表中拉出,如下所示:
当然,您也可以轻松地将作业从队列中拉出并在 Ruby 中执行它们:
还有更多有关计算需求及其运行环境的详细信息,就有可能推荐更简单的管理方法。
While this isn't directly what you asked, you should really, really consider fixing your problem at this level, using some sort of shared message queue is likely to be a lot simpler to manage and debug than relying on the locking semantics of a particular networked filesystem.
The simplest solution to set up and run in my experience is redis on the machine currently running the Ruby script that creates the jobs. It should literally be as simple as downloading the source, compiling it and starting it up. Once the redis server is up and running, you change your code to append your the computation commands to one or more Redis lists. In ruby you would use the redis-rb library like this:
If the computations need to be handled by certain machines, use a list per-machine like this:
Then in your Python code, you can use redis-py to connect to the Redis server and pull jobs off the list like so:
Of course, you could just as easily pull jobs off the queue and execute them in Ruby:
With some more details about the needs of the computation and the environment it's running in, it would be possible to recommend even simpler approaches to managing it.
尝试一下 while 循环?如果这不起作用,请在 python 端尝试使用 TRY 语句,如下所示:
Try a while loop? If that doesn't work, on the python side try using a TRY statement like so: