mpi4py 与进程和线程

发布于 2024-10-31 05:55:34 字数 1642 浏览 1 评论 0原文

您好,这是一个非常具体的问题,所以我希望 StackOverflow 适用于所有编程语言,而不仅仅是 javascript/html

我正在 MPICH2(流行的消息传递接口)中编写一个多程序。我的程序是用 Python 编写的,因此我使用 MPI4Py Python 绑定。 MPI 最适合没有共享内存的情况,因此,它并不适合多核编程。为了使用 5 节点集群的全部 4 个核心,我进一步使用了线程。然而,我注意到使用线程实际上会减慢我的模拟速度。我的程序有几万行代码,所以我不能把它们全部列出来,但这是导致问题的代码片段

from threading import Thread
...
threadIndeces=[[0,10],[11,20],[21,30],[31,40]] #subset for each thread
for indeces in treadIndeces:
  t=Thread(target=foo,args=(indeces,))
  t.start()

另外,我确保稍后加入线程。如果我在没有线程的情况下运行它,并且仅使用所有索引调用 foo ,则速度大约快 10-15 倍。当我记录多线程版本的时间时,调用 t=Thread(target=foo,args=(indeces,)) 中线程的创建大约需要 0.05 秒,连接同样需要 0.05 秒秒,但 t.start() 调用花费了高达 0.2 秒的时间。

start() 是一个昂贵的调用吗?我应该改变我的方法吗?我考虑过保留一个线程池,而不是每次迭代都创建新线程,但这似乎并不像 t=Thread(target=foo,args=(indeces,)) 是导致速度变慢的原因。

另外,如果有人想知道 foo 的复杂性,这里是每次迭代都会被调用 iindeces 的函数之一(非离散时间):

def HD_training_firing_rate(HD_cell):
    """During training, the firing rate is governed by the difference between the 
       current heading direction and the preferred heading direction. This is made
       to resemble a Gaussian distribution
    """
    global fabs
    global exp
    global direction

    #loop over twice due to concurrent CW and CCW HD training
    for c in [0,1]:
        d=direction[c]
        dp=HD_cell.dp  #directional preferance
        s_d=20.0  #standard deviation
        s_i=min(fabs(dp-d),360-fabs(dp-d)) #circular deviation from preferred dir.

        HD_cell.r[c]=exp(-s_i*s_i/(2*s_d*s_d))  #normal distribution

Hi This is a pretty specific question, so I hope StackOverflow is meant for all programming languages and not just javascript/html

I am writing a multi program in MPICH2 (popular message passing interface). My program is written in Python so I use the MPI4Py Python bindings. MPI is best for situations with no shared memory, therefore, it is not ideal for multicore programming. To use the full 4 cores of my 5 node cluster I am further using threads. However, I have noticed that using threads actually slows my simulation down. My program is several tens of thousands of lines of code, so I can not put it all up, but here is the snippet which is causing problems

from threading import Thread
...
threadIndeces=[[0,10],[11,20],[21,30],[31,40]] #subset for each thread
for indeces in treadIndeces:
  t=Thread(target=foo,args=(indeces,))
  t.start()

Also, I make sure to join the threads later. If I run it with no threads, and just call foo with all the indeces, it is about 10-15x times faster. When I record the times of the multithreaded version, the creation of the threads in the call t=Thread(target=foo,args=(indeces,)) takes around 0.05 seconds, the join similarly takes 0.05 seconds but the t.start() calls takes a whopping 0.2 seconds.

Is start() an expensive call? Should I be changing my approach? I thought about keeping a pool of threads rather than creating new ones every iteration, but it does not seem like the
t=Thread(target=foo,args=(indeces,)) is what's causing the slow down.

Also, incase anyone wants to know the complexity of the foo, here is one of the functions which gets called i times for the indeces every iteration (non discrete time):

def HD_training_firing_rate(HD_cell):
    """During training, the firing rate is governed by the difference between the 
       current heading direction and the preferred heading direction. This is made
       to resemble a Gaussian distribution
    """
    global fabs
    global exp
    global direction

    #loop over twice due to concurrent CW and CCW HD training
    for c in [0,1]:
        d=direction[c]
        dp=HD_cell.dp  #directional preferance
        s_d=20.0  #standard deviation
        s_i=min(fabs(dp-d),360-fabs(dp-d)) #circular deviation from preferred dir.

        HD_cell.r[c]=exp(-s_i*s_i/(2*s_d*s_d))  #normal distribution

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

虚拟世界 2024-11-07 05:55:34

如果您需要线程,Python 可能不是您的最佳选择,因为 全局解释器锁 会阻止 true并发。另请参阅Dave Beazly 令人不安的谈话

您可能最好只运行 20 个进程来保持 4 个核心和 5 个节点繁忙,并且只使用 MPI 进行所有通信。

Python 在大铁上产生了大量的开销——如果您真的致力于联合线程/消息传递方法,您可能需要考虑 C 或 C++(或者我敢说 Fortran?)。

If you need threads, python may not be your best option due to the Global Interpreter Lock which prevents true concurrency. See also Dave Beazly's disturbing talk.

You might be better off just running 20 processes to keep your 4 cores and 5 nodes busy, and just use MPI for all communication.

Python incurs a lot of overhead on the big iron--you may want to think about C or C++ (or dare I say Fortran?) if you're really committed to a joint threads/message passing approach.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文