使用 100% IO 监控并杀死失控进程?

发布于 2024-08-31 18:46:02 字数 231 浏览 12 评论 0原文

我有一些必须以高优先级运行的进程(chrt 98),它们偶尔会决定硬锁定并以 100% 固定 1 个核心(没什么大不了的),但更重要的是,它将使用系统上的所有 IO ,以至于不可能通过 ssh 登录到机器来杀死它或在未加载到 RAM 的机器上执行任何任务。如果我碰巧已经运行了像 htop 这样的东西,我就可以很好地结束该过程。是否有任何类型的实用程序/方法可以监视此类失控进程并杀死任何使用 100% 系统 IO 时间超过 X 时间的进程?谢谢!

i have a few processes that have to be run at high priority (chrt 98) that will occasionally decide to hard-lock and peg 1 core at 100% (not a huge deal) but more importantly it will use all the IO on a system, so much that its impossible to log into the machine via ssh to kill it or perform any task on the machine that isn't loaded into ram. If i happen to have something like htop already running i am able to end the process fine. Is there any type of utility/way to monitor for this type of runaway process and kill anything that uses 100% of system IO for more than X amount of time? Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

过度放纵 2024-09-07 18:46:02

不能以 nice 启动程序(并且优先级较低)吗?这样至少你应该能够通过 ssh 进入盒子并轻松杀死它。

更好的解决方案当然是修复违规进程的行为(需要详细信息)。

此 serverfault 线程 似乎也包含您所要求的内容专门为.

Can't you start the program with nice (and with a lower priority)? This way at least you should be able to ssh into the box and kill it easily.

The better solution would off course be to fix the behaviour of the offending process (details needed).

This serverfault thread also seems to contain what you ask for specifically.

日暮斜阳 2024-09-07 18:46:02

假设应用程序消耗的是磁盘 IO,您可以将其访问的文件系统移动到单独的磁盘上吗?这样,您将在安装操作系统的磁盘上有空闲的 IO,并且应该能够登录和管理(即杀死!)进程。

Assuming that it's disk IO that the app is consuming, can you just move the filesystems it's accessing onto separate disks? That way you'll have IO to spare on the disks which the OS is installed on, and should be able to log in and manage (i.e. kill!) the process.

酒与心事 2024-09-07 18:46:02

正如另一位发帖者所说,使用 nice 运行进程是正确的方法,但您确实提到您希望以高优先级运行它,这很奇怪......请注意,如果您'以最高优先级运行一个进程并且它已被锁定,您的监控系统甚至可能无法杀死它,除非您的监视器仍然处于更高的优先级。无论如何......

god 以及其他几个进程管理工具,可以轻松杀死一个进程,如果它是以多种方式行为不当..配置看起来像这样 - 您以特定的时间间隔设置检查,然后您可以说“经过五次检查,如果 CPU 使用率始终高于 98%,则对其进行核攻击”:

  restart.condition(:cpu_usage) do |c|
    c.above = 98.percent
    c.times = 5
  end

另一个不同的做法是,您可能会看看 runit 系统中的 chpst - 它允许您优雅地设置事物的界限(但对于 CPU 限制,nice 仍然是我可以使用的工具首先)。

As another poster said, running your process with nice is the way to go, but you did mention that you want to run it at a high priority, which is odd... be aware that if you're running a process at the highest priority and it's pegged, your monitoring system might not even be able to kill it, unless your monitor is at a higher priority still. Anyway....

god, as well as several other process managment tools, can easily kill a process if it's misbehaving in any of several ways.. config looks like this - you set checks at a particular interval, and then you can say "after five checks, nuke it if it's been above 98% CPU usage consistently":

  restart.condition(:cpu_usage) do |c|
    c.above = 98.percent
    c.times = 5
  end

Another, different take that you might have a look at is chpst from the runit system - it allows you to elegantly set bounds on things (but for CPU limiting, nice is still the tool I'd reach for first).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文