令人困惑的并行 Python 问题 - TRANSPORT_SOCKET_TIMEOUT

发布于 2024-09-30 21:15:04 字数 1193 浏览 4 评论 0原文

以下代码似乎对我来说无法正常工作。它需要在网络上的另一台计算机上启动 ppserver，例如使用以下命令：

ppserver.py -r -a -w 4

启动此服务器后，我在我的计算机上运行以下代码：

import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
    r = x
    for i in xrange(10**9):
        r+=1
    f = open('/home/tomb/statusfile.txt', 'a')
    f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
    return r

jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))

for job in jobs:
    print job()
print 'done'

奇怪的部分：观察 /home/tomb/statusfile.txt，我可以看到它被写入多次，就好像该函数运行了多次一样。我之前观察到这种情况持续了一个多小时，但从未见过 job() 返回。

奇数：如果我将 testfunc 定义中的迭代次数更改为 10**8，则该函数仅运行一次，并按预期返回结果！

看起来像是某种竞争条件？仅使用本地核心即可正常工作。这是 pp v 1.6.0 和 1.5.7 的情况。

更新：大约 775,000,000：我得到不一致的结果：两个作业重复一次，第一次完成。

一周后更新：我已经编写了自己的并行处理模块来解决这个问题，并且将来会避免使用并行 python，除非有人解决了这个问题 - 我会抽出时间来更多地研究它（实际上是深入研究源代码）代码）在某个时刻。

几个月后更新：没有剩余的怨恨，并行 Python。我计划一有时间迁移我的应用程序就搬回去。标题编辑以反映解决方案。

原文

The following code doesn't appear to work properly for me. It requires starting a ppserver on another computer on your network, for example with the following command:

ppserver.py -r -a -w 4

Once this server is started, on my machine I run this code:

import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
    r = x
    for i in xrange(10**9):
        r+=1
    f = open('/home/tomb/statusfile.txt', 'a')
    f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
    return r

jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))

for job in jobs:
    print job()
print 'done'

The odd part:
Watching the /home/tomb/statusfile.txt, I can see that it's getting written to several times, as though the function is being run several times. I've observed this continuing for over an hour before, and never seen a job() return.

Odder:
If I change the number of iterations in the testfunc definition to 10**8, the function is just run once, and returns a result as expected!

Seems like some kind of race condition? Just using local cores works fine. This is with pp v 1.6.0 and 1.5.7.

Update: Around 775,000,000: I get inconsistent results: two jobs repeat once, on finishes the first time.

Week later update: I've written my own parallel processing module to get around this, and will avoid parallel python in the future, unless someone figures this out - I'll get around to looking at it some more (actually diving into the source code) at some point.

Months later update: No remaining hard feelings, Parallel Python. I plan to move back as soon as I have time to migrate my application. Title edit to reflect solution.

分享到QQ

分享到微博