在一个线程中运行繁忙任务时所有线程都会挂起

发布于 2024-11-14 00:44:32 字数 704 浏览 3 评论 0原文

我有一个多线程 python 应用程序,其中生成线程来执行各种任务。该应用程序几个月来一直运行良好,但最近我遇到了问题。

其中一个线程启动一个 python subprocess.Popen 对象,该对象正在运行密集型数据复制命令。

copy = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, preexec_fn = os.setsid, shell = False, close_fds = True)
if copy.wait():
  raise Exception("Unable to copy!")

当复制命令运行时,整个应用程序最终陷入困境,我的其他线程一次都没有运行几分钟。一旦复制完成,一切都会从上次中断的地方恢复。

我正在尝试找出如何防止这种情况发生。我最好的理论 ATM 是它与我的内核调度进程的方式有关。我添加了对 setsid() 的调用,以获取与主 python 应用程序分开安排的复制过程,但这没有效果。

我假设所有 copy.wait() 函数所做的都是 waitpid()。是否有可能调用需要很长时间,在此期间某个线程持有 GIL?如果是这样,我该如何预防/处理这个问题?我可以做什么来进一步调试这个问题?

I have a multi-threaded python application where threads are spawned off to do various tasks. This application has been working great for months, but recently I've run into a problem.

One of the threads starts a python subprocess.Popen object which is running an intensive data copy command.

copy = subprocess.Popen(cmd, stdout = subprocess.PIPE, stderr = subprocess.STDOUT, preexec_fn = os.setsid, shell = False, close_fds = True)
if copy.wait():
  raise Exception("Unable to copy!")

While the copy command is running, the entire application eventually bogs down, with none of my other threads running for minutes at a time. Once copy finishes, everything resumes where it left off.

I'm trying to figure out how to prevent this from happening. My best theory ATM is that it has something to do with the way my kernel is scheduling processes. I added the call to setsid() to get the copy process scheduled separately from the main python app, but this has no effect.

I'm assuming all the copy.wait() function does is a waitpid(). Is it possible that the call takes a long time, during which that one thread holds the GIL? If so, how do I prevent/deal with this? What can I do to debug this further?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柏林苍穹下 2024-11-21 00:44:32

copy.wait() 持有 GIL 也是我的第一个怀疑。但是,在我的系统上似乎并非如此(wait() 调用不会阻止其他线程继续进行)。

你是对的,copy.wait()最终以os.waitpid()结束。后者在我的 Linux 系统上看起来像这样:

PyDoc_STRVAR(posix_waitpid__doc__,
"waitpid(pid, options) -> (pid, status)\n\n\
Wait for completion of a given child process.");

static PyObject *
posix_waitpid(PyObject *self, PyObject *args)
{
    pid_t pid;
    int options;
    WAIT_TYPE status;
    WAIT_STATUS_INT(status) = 0;

    if (!PyArg_ParseTuple(args, PARSE_PID "i:waitpid", &pid, &options))
        return NULL;
    Py_BEGIN_ALLOW_THREADS
    pid = waitpid(pid, &status, options);
    Py_END_ALLOW_THREADS
    if (pid == -1)
        return posix_error();

    return Py_BuildValue("Ni", PyLong_FromPid(pid), WAIT_STATUS_INT(status));
}

这清楚地释放了 GIL,同时它在 POSIX waitpid 中被阻止。

gdb 挂起时,我会尝试将其附加到 python 进程,以查看线程正在做什么。也许这可以提供一些想法。

编辑这就是多线程Python进程在gdb中的样子:

(gdb) info threads
  11 Thread 0x7f82c6462700 (LWP 30865)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  10 Thread 0x7f82c5c61700 (LWP 30866)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  9 Thread 0x7f82c5460700 (LWP 30867)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  8 Thread 0x7f82c4c5f700 (LWP 30868)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  7 Thread 0x7f82c445e700 (LWP 30869)  0x00000000004a3c37 in PyEval_EvalFrameEx ()
  6 Thread 0x7f82c3c5d700 (LWP 30870)  0x00007f82c7676dcd in sem_post () from /lib/libpthread.so.0
  5 Thread 0x7f82c345c700 (LWP 30871)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  4 Thread 0x7f82c2c5b700 (LWP 30872)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  3 Thread 0x7f82c245a700 (LWP 30873)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  2 Thread 0x7f82c1c59700 (LWP 30874)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
* 1 Thread 0x7f82c7a7c700 (LWP 30864)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0

这里,除了两个之外的所有线程都在等待GIL。典型的堆栈跟踪如下所示:

(gdb) thread 11
[Switching to thread 11 (Thread 0x7f82c6462700 (LWP 30865))] #0  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
(gdb) where
#0  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
#1  0x00000000004d4498 in PyThread_acquire_lock ()
#2  0x00000000004a2f3f in PyEval_EvalFrameEx ()
#3  0x00000000004a9671 in PyEval_EvalCodeEx ()
...

您可以通过在 Python 代码中打印 hex(t.ident) 来确定哪个线程是哪个线程,其中 tthreading.Thread 对象。在我的系统上,这与 gdb 中看到的线程 ID(0x7f82c6462700 等)相匹配。

copy.wait() holding the GIL was my first suspicion too. However, this doesn't appear to be the case on my system (a wait() call isn't preventing other threads from progressing).

You are right that copy.wait() eventually ends up in os.waitpid(). The latter looks like this on my Linux system:

PyDoc_STRVAR(posix_waitpid__doc__,
"waitpid(pid, options) -> (pid, status)\n\n\
Wait for completion of a given child process.");

static PyObject *
posix_waitpid(PyObject *self, PyObject *args)
{
    pid_t pid;
    int options;
    WAIT_TYPE status;
    WAIT_STATUS_INT(status) = 0;

    if (!PyArg_ParseTuple(args, PARSE_PID "i:waitpid", &pid, &options))
        return NULL;
    Py_BEGIN_ALLOW_THREADS
    pid = waitpid(pid, &status, options);
    Py_END_ALLOW_THREADS
    if (pid == -1)
        return posix_error();

    return Py_BuildValue("Ni", PyLong_FromPid(pid), WAIT_STATUS_INT(status));
}

This clearly releases the GIL while it's blocked in POSIX waitpid.

I would try attaching gdb to the python process when it's hung to see what the threads are doing. Perhaps this would provide some ideas.

edit This is what a multi-threaded Python process looks like in gdb:

(gdb) info threads
  11 Thread 0x7f82c6462700 (LWP 30865)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  10 Thread 0x7f82c5c61700 (LWP 30866)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  9 Thread 0x7f82c5460700 (LWP 30867)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  8 Thread 0x7f82c4c5f700 (LWP 30868)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  7 Thread 0x7f82c445e700 (LWP 30869)  0x00000000004a3c37 in PyEval_EvalFrameEx ()
  6 Thread 0x7f82c3c5d700 (LWP 30870)  0x00007f82c7676dcd in sem_post () from /lib/libpthread.so.0
  5 Thread 0x7f82c345c700 (LWP 30871)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  4 Thread 0x7f82c2c5b700 (LWP 30872)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  3 Thread 0x7f82c245a700 (LWP 30873)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
  2 Thread 0x7f82c1c59700 (LWP 30874)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
* 1 Thread 0x7f82c7a7c700 (LWP 30864)  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0

Here, all threads but two are waiting for the GIL. A typical stack trace goes like this:

(gdb) thread 11
[Switching to thread 11 (Thread 0x7f82c6462700 (LWP 30865))] #0  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
(gdb) where
#0  0x00007f82c7676b50 in sem_wait () from /lib/libpthread.so.0
#1  0x00000000004d4498 in PyThread_acquire_lock ()
#2  0x00000000004a2f3f in PyEval_EvalFrameEx ()
#3  0x00000000004a9671 in PyEval_EvalCodeEx ()
...

You can figure out which thread is which by printing hex(t.ident) in your Python code, where t is a threading.Thread object. On my system, this matches up with the thread ids seen in gdb (0x7f82c6462700 et al).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文