什么是不间断进程?
有时,每当我在 Linux 中编写一个程序,并且由于某种错误而崩溃时,它将成为一个不间断的进程并继续运行,直到我重新启动计算机(即使我注销)。 我的问题是:
- 是什么导致进程变得不间断?
- 我该如何阻止这种情况发生?
- 这可能是一个愚蠢的问题,但是有没有办法在不重新启动计算机的情况下中断它?
Sometimes whenever I write a program in Linux and it crashes due to a bug of some sort, it will become an uninterruptible process and continue running forever until I restart my computer (even if I log out). My questions are:
- What causes a process to become uninterruptible?
- How do I stop that from happening?
- This is probably a dumb question, but is there any way to interrupt it without restarting my computer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
不可中断进程是指恰好处于系统调用(内核函数)中且不能被信号中断的进程。
要理解这意味着什么,您需要理解可中断系统调用的概念。 典型的例子是
read()
。 这是一个可能需要很长时间(几秒)的系统调用,因为它可能涉及旋转硬盘驱动器或移动磁头。 在这段时间的大部分时间里,进程将处于休眠状态,在硬件上阻塞。当进程在系统调用中休眠时,它可以接收 Unix 异步信号(例如,SIGTERM),然后发生以下情况:
从系统调用提前返回使用户空间代码能够立即改变其行为以响应信号。 例如,响应 SIGINT 或 SIGTERM 干净地终止。
另一方面,有些系统调用是不允许以这种方式中断的。 如果系统调用由于某种原因停止,进程可能会无限期地保持在这种不可终止的状态。
LWN 在 7 月份发表了一篇好文章,涉及到这个主题。
回答最初的问题:
如何防止这种情况发生:找出哪个驱动程序给您带来麻烦,然后停止使用,或者成为内核黑客并修复它。
如何在不重新启动的情况下终止不可中断的进程:以某种方式使系统调用终止。 通常,在不按下电源开关的情况下执行此操作的最有效方法是拉电源线。 您还可以成为内核黑客并让驱动程序使用 TASK_KILLABLE,如 LWN 文章中所述。
An uninterruptible process is a process which happens to be in a system call (kernel function) that cannot be interrupted by a signal.
To understand what that means, you need to understand the concept of an interruptible system call. The classic example is
read()
. This is a system call that can take a long time (seconds) since it can potentially involve spinning up a hard drive, or moving heads. During most of this time, the process will be sleeping, blocking on the hardware.While the process is sleeping in the system call, it can receive a Unix asynchronous signal (say, SIGTERM), then the following happens:
Returning early from the system call enables the user space code to immediately alter its behavior in response to the signal. For example, terminating cleanly in reaction to SIGINT or SIGTERM.
On the other hand, some system calls are not allowed to be interrupted in this way. If the system calls stalls for some reason, the process can indefinitely remains in this unkillable state.
LWN ran a nice article that touched this topic in July.
To answer the original question:
How to prevent this from happening: figure out which driver is causing you trouble, and either stop using, or become a kernel hacker and fix it.
How to kill an uninterruptible process without rebooting: somehow make the system call terminate. Frequently the most effective manner to do this without hitting the power switch is to pull the power cord. You can also become a kernel hacker and make the driver use TASK_KILLABLE, as explained in the LWN article.
当进程处于用户模式时,它可以随时被中断(切换到内核模式)。 当内核返回到用户模式时,它会检查是否有任何待处理的信号(包括用于终止进程的信号,例如
SIGTERM
和SIGKILL
)。 这意味着只有在返回用户模式时才能终止进程。无法在内核模式下杀死进程的原因是,它可能会破坏同一台计算机中所有其他进程使用的内核结构(就像杀死线程可能会破坏同一进程中其他线程使用的数据结构一样) 。
当内核需要做一些可能需要很长时间的事情时(例如,等待另一个进程写入的管道或等待硬件做某事),它会通过将自己标记为睡眠并调用调度程序切换到另一个进程来睡眠进程(如果没有非睡眠进程,它会切换到“虚拟”进程,告诉 cpu 放慢一点速度并进入循环 — 空闲循环)。
如果将信号发送到睡眠进程,则必须先将其唤醒,然后才能返回用户空间并处理待处理的信号。 这里我们有两种主要睡眠类型之间的区别:
TASK_UNINTERRUPTIBLE
,不间断睡眠。 如果一个任务标有此标志,则除了它正在等待的任务之外,它不会被任何其他任务唤醒,要么是因为它无法轻松重新启动,要么是因为程序期望系统调用是原子的。 这也可用于已知非常短的睡眠。TASK_KILLABLE
(在 ddaa 的答案链接的 LWN 文章中提到)是一个新变体。这回答了你的第一个问题。 至于你的第二个问题:你无法避免不间断的睡眠,它们是正常的事情(例如,每次进程从磁盘读取/写入磁盘时都会发生这种情况); 然而,它们应该只持续几分之一秒。 如果它们持续的时间更长,通常意味着硬件问题(或设备驱动程序问题,这对于内核来说是相同的),设备驱动程序正在等待硬件执行一些永远不会发生的事情。 这也可能意味着您正在使用 NFS 并且 NFS 服务器已关闭(它正在等待服务器恢复;您也可以使用“intr”选项来避免该问题)。
最后,您无法恢复的原因与内核等待返回用户模式以传递信号或终止进程的原因相同:它可能会损坏内核的数据结构(等待可中断睡眠的代码可能会收到一个错误,告诉它返回到用户空间,可以在其中终止进程;等待不间断睡眠的代码不会出现任何错误)。
When a process is on user mode, it can be interrupted at any time (switching to kernel mode). When the kernel returns to user mode, it checks if there are any signals pending (including the ones which are used to kill the process, such as
SIGTERM
andSIGKILL
). This means a process can be killed only on return to user mode.The reason a process cannot be killed in kernel mode is that it could potentially corrupt the kernel structures used by all the other processes in the same machine (the same way killing a thread can potentially corrupt data structures used by other threads in the same process).
When the kernel needs to do something which could take a long time (waiting on a pipe written by another process or waiting for the hardware to do something, for instance), it sleeps by marking itself as sleeping and calling the scheduler to switch to another process (if there is no non-sleeping process, it switches to a "dummy" process which tells the cpu to slow down a bit and sits in a loop — the idle loop).
If a signal is sent to a sleeping process, it has to be woken up before it will return to user space and thus process the pending signal. Here we have the difference between the two main types of sleep:
TASK_INTERRUPTIBLE
, the interruptible sleep. If a task is marked with this flag, it is sleeping, but can be woken by signals. This means the code which marked the task as sleeping is expecting a possible signal, and after it wakes up will check for it and return from the system call. After the signal is handled, the system call can potentially be automatically restarted (and I won't go into details on how that works).TASK_UNINTERRUPTIBLE
, the uninterruptible sleep. If a task is marked with this flag, it is not expecting to be woken up by anything other than whatever it is waiting for, either because it cannot easily be restarted, or because programs are expecting the system call to be atomic. This can also be used for sleeps known to be very short.TASK_KILLABLE
(mentioned in the LWN article linked to by ddaa's answer) is a new variant.This answers your first question. As to your second question: you can't avoid uninterruptible sleeps, they are a normal thing (it happens, for instance, every time a process reads/writes from/to the disk); however, they should last only a fraction of a second. If they last much longer, it usually means a hardware problem (or a device driver problem, which looks the same to the kernel), where the device driver is waiting for the hardware to do something which will never happen. It can also mean you are using NFS and the NFS server is down (it is waiting for the server to recover; you can also use the "intr" option to avoid the problem).
Finally, the reason you cannot recover is the same reason the kernel waits until return to user mode to deliver a signal or kill the process: it would potentially corrupt the kernel's data structures (code waiting on an interruptible sleep can receive an error which tells it to return to user space, where the process can be killed; code waiting on an uninterruptible sleep is not expecting any error).
不间断进程通常在页面错误后等待 I/O。
考虑一下:
进程/任务在此状态下不能被中断,因为它无法处理任何信号; 如果确实如此,就会发生另一个页面错误,并且它会回到原来的位置。
当我说“进程”时,我真正的意思是“任务”,在 Linux(2.6)下它大致翻译为“线程”,它在 /proc 中可能有也可能没有单独的“线程组”条目
在某些情况下,它可能正在等待许久。 一个典型的例子是可执行文件或 mmap 文件位于服务器发生故障的网络文件系统上。 如果 I/O 最终成功,任务将继续。 如果最终失败,任务通常会得到一个 SIGBUS 或其他东西。
Uninterruptable processes are USUALLY waiting for I/O following a page fault.
Consider this:
The process/task cannot be interrupted in this state, because it can't handle any signals; if it did, another page fault would happen and it would be back where it was.
When I say "process", I really mean "task", which under Linux (2.6) roughly translates to "thread" which may or may not have an individual "thread group" entry in /proc
In some cases, it may be waiting for a long time. A typical example of this would be where the executable or mmap'd file is on a network filesystem where the server has failed. If the I/O eventually succeeds, the task will continue. If it eventually fails, the task will generally get a SIGBUS or something.
对于你的第三个问题:
我认为你可以通过运行来终止不可中断的进程
sudo Kill -HUP 1
。它将重新启动 init 而不结束正在运行的进程,运行它后,我的不间断进程消失了。
To your 3rd question:
I think you can kill the uninterruptable processes by running
sudo kill -HUP 1
.It will restart init without ending the running processes and after running it, my uninterruptable processes were gone.
如果您正在谈论“僵尸”进程(在 ps 输出中指定为“zombie”),那么这是进程列表中的无害记录,等待有人收集其返回代码,并且可以安全地忽略它。
您能描述一下什么是“不间断的过程”吗? 它能在“kill -9”中幸存下来并快乐地前进吗? 如果是这种情况,那么它会卡在某些系统调用上,而该系统调用会卡在某些驱动程序中,并且您会卡在这个过程中,直到重新启动(有时最好尽快重新启动)或卸载相关驱动程序(这不太可能发生) 。 您可以尝试使用“strace”来找出您的进程被卡住的位置并在将来避免它。
If you are talking about a "zombie" process (which is designated as "zombie" in ps output), then this is a harmless record in the process list waiting for someone to collect its return code and it could be safely ignored.
Could you please describe what and "uninterruptable process" is for you? Does it survives the "kill -9 " and happily chugs along? If that is the case, then it's stuck on some syscall, which is stuck in some driver, and you are stuck with this process till reboot (and sometimes it's better to reboot soon) or unloading of relevant driver (which is unlikely to happen). You could try to use "strace" to find out where your process is stuck and avoid it in the future.