令人困惑的并行 Python 问题 - TRANSPORT_SOCKET_TIMEOUT

发布于 2024-09-30 21:15:04 字数 1193 浏览 4 评论 0原文

以下代码似乎对我来说无法正常工作。它需要在网络上的另一台计算机上启动 ppserver,例如使用以下命令:

ppserver.py -r -a -w 4

启动此服务器后,我在我的计算机上运行以下代码:

import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
    r = x
    for i in xrange(10**9):
        r+=1
    f = open('/home/tomb/statusfile.txt', 'a')
    f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
    return r

jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))

for job in jobs:
    print job()
print 'done'

奇怪的部分: 观察 /home/tomb/statusfile.txt,我可以看到它被写入多次,就好像该函数运行了多次一样。我之前观察到这种情况持续了一个多小时,但从未见过 job() 返回。

奇数: 如果我将 testfunc 定义中的迭代次数更改为 10**8,则该函数仅运行一次,并按预期返回结果!

看起来像是某种竞争条件?仅使用本地核心即可正常工作。这是 pp v 1.6.0 和 1.5.7 的情况。

更新:大约 775,000,000:我得到不一致的结果:两个作业重复一次,第一次完成。

一周后更新:我已经编写了自己的并行处理模块来解决这个问题,并且将来会避免使用并行 python,除非有人解决了这个问题 - 我会抽出时间来更多地研究它(实际上是深入研究源代码)代码)在某个时刻。

几个月后更新:没有剩余的怨恨,并行 Python。我计划一有时间迁移我的应用程序就搬回去。标题编辑以反映解决方案。

The following code doesn't appear to work properly for me. It requires starting a ppserver on another computer on your network, for example with the following command:

ppserver.py -r -a -w 4

Once this server is started, on my machine I run this code:

import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
    r = x
    for i in xrange(10**9):
        r+=1
    f = open('/home/tomb/statusfile.txt', 'a')
    f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
    return r

jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))

for job in jobs:
    print job()
print 'done'

The odd part:
Watching the /home/tomb/statusfile.txt, I can see that it's getting written to several times, as though the function is being run several times. I've observed this continuing for over an hour before, and never seen a job() return.

Odder:
If I change the number of iterations in the testfunc definition to 10**8, the function is just run once, and returns a result as expected!

Seems like some kind of race condition? Just using local cores works fine. This is with pp v 1.6.0 and 1.5.7.

Update: Around 775,000,000: I get inconsistent results: two jobs repeat once, on finishes the first time.

Week later update: I've written my own parallel processing module to get around this, and will avoid parallel python in the future, unless someone figures this out - I'll get around to looking at it some more (actually diving into the source code) at some point.

Months later update: No remaining hard feelings, Parallel Python. I plan to move back as soon as I have time to migrate my application. Title edit to reflect solution.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

李白 2024-10-07 21:15:04

Parallel Python 论坛 Bagira 的回答:

每次计算需要多长时间
工作采取?看一下变量
TRANSPORT_SOCKET_TIMEOUT
/usr/local/lib/python2.6/dist-packages/pptransport.py。

也许你的工作时间比
上面变量中的时间。增加
它的价值并尝试。

事实证明这正是问题所在。在我的应用程序中,我使用 PP 作为作业的批处理调度程序,这可能需要几分钟的时间,因此我需要对此进行调整。 (默认为30秒)

Answer from Bagira of the Parallel Python forum:

How long does the calculation of every
job take? Have a look at the variable
TRANSPORT_SOCKET_TIMEOUT in
/usr/local/lib/python2.6/dist-packages/pptransport.py.

Maybe your job takes longer than the
time in the variable above. Increase
the value of it and try.

Turns out this was exactly the problem. In my application I'm using PP as a batch scheduler of jobs that can take several minutes, so I need to adjust this. (the default was 30s)

黑凤梨 2024-10-07 21:15:04

该库可能允许重复,因为某些节点落后,因此将有很长的剩余任务需要完成。通过复制任务,它可以绕过慢速节点,您应该只获取最先完成的结果。您可以通过为每个任务包含一个唯一的 ID 并仅接受每个任务返回的第一个 ID 来解决此问题。

It may be that the library allows duplicates as some nodes lag behind there will be a long tail of remaining tasks to complete. By duplicating the tasks, it can bypass the slow nodes and you should just take the result that finishes first. You can get around this by including a unique id for each task and accept only the first one to return for each.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文