令人困惑的并行 Python 问题 - TRANSPORT_SOCKET_TIMEOUT
以下代码似乎对我来说无法正常工作。它需要在网络上的另一台计算机上启动 ppserver,例如使用以下命令:
ppserver.py -r -a -w 4
启动此服务器后,我在我的计算机上运行以下代码:
import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
r = x
for i in xrange(10**9):
r+=1
f = open('/home/tomb/statusfile.txt', 'a')
f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
return r
jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))
for job in jobs:
print job()
print 'done'
奇怪的部分: 观察 /home/tomb/statusfile.txt,我可以看到它被写入多次,就好像该函数运行了多次一样。我之前观察到这种情况持续了一个多小时,但从未见过 job()
返回。
奇数: 如果我将 testfunc 定义中的迭代次数更改为 10**8,则该函数仅运行一次,并按预期返回结果!
看起来像是某种竞争条件?仅使用本地核心即可正常工作。这是 pp v 1.6.0 和 1.5.7 的情况。
更新:大约 775,000,000:我得到不一致的结果:两个作业重复一次,第一次完成。
一周后更新:我已经编写了自己的并行处理模块来解决这个问题,并且将来会避免使用并行 python,除非有人解决了这个问题 - 我会抽出时间来更多地研究它(实际上是深入研究源代码)代码)在某个时刻。
几个月后更新:没有剩余的怨恨,并行 Python。我计划一有时间迁移我的应用程序就搬回去。标题编辑以反映解决方案。
The following code doesn't appear to work properly for me. It requires starting a ppserver on another computer on your network, for example with the following command:
ppserver.py -r -a -w 4
Once this server is started, on my machine I run this code:
import pp
import time
job_server = pp.Server(ppservers = ("*",))
job_server.set_ncpus(0)
def addOneBillion(x):
r = x
for i in xrange(10**9):
r+=1
f = open('/home/tomb/statusfile.txt', 'a')
f.write('finished at '+time.asctime()+' for job with input '+str(x)+'\n')
return r
jobs = []
jobs.append(job_server.submit(addOneBillion, (1,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (2,), (), ("time",)))
jobs.append(job_server.submit(addOneBillion, (3,), (), ("time",)))
for job in jobs:
print job()
print 'done'
The odd part:
Watching the /home/tomb/statusfile.txt, I can see that it's getting written to several times, as though the function is being run several times. I've observed this continuing for over an hour before, and never seen a job()
return.
Odder:
If I change the number of iterations in the testfunc definition to 10**8, the function is just run once, and returns a result as expected!
Seems like some kind of race condition? Just using local cores works fine. This is with pp v 1.6.0 and 1.5.7.
Update: Around 775,000,000: I get inconsistent results: two jobs repeat once, on finishes the first time.
Week later update: I've written my own parallel processing module to get around this, and will avoid parallel python in the future, unless someone figures this out - I'll get around to looking at it some more (actually diving into the source code) at some point.
Months later update: No remaining hard feelings, Parallel Python. I plan to move back as soon as I have time to migrate my application. Title edit to reflect solution.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Parallel Python 论坛 Bagira 的回答:
事实证明这正是问题所在。在我的应用程序中,我使用 PP 作为作业的批处理调度程序,这可能需要几分钟的时间,因此我需要对此进行调整。 (默认为30秒)
Answer from Bagira of the Parallel Python forum:
Turns out this was exactly the problem. In my application I'm using PP as a batch scheduler of jobs that can take several minutes, so I need to adjust this. (the default was 30s)
该库可能允许重复,因为某些节点落后,因此将有很长的剩余任务需要完成。通过复制任务,它可以绕过慢速节点,您应该只获取最先完成的结果。您可以通过为每个任务包含一个唯一的 ID 并仅接受每个任务返回的第一个 ID 来解决此问题。
It may be that the library allows duplicates as some nodes lag behind there will be a long tail of remaining tasks to complete. By duplicating the tasks, it can bypass the slow nodes and you should just take the result that finishes first. You can get around this by including a unique id for each task and accept only the first one to return for each.