如何可靠地检测资源消耗异常？

发布于 2024-07-10 14:01:31 字数 877 浏览 17 评论 0原文

这个问题是关于一整类类似的问题，但我将把它作为一个具体的例子来问。

我有一台带有文件系统的服务器，其内容会波动。我需要监视该文件系统上的可用空间以确保它不会填满。为了便于讨论，我们假设如果满了，服务器就会宕机。

它是什么并不重要——例如，它可能是一个“工作”队列。

在“正常”操作期间，可用空间在“正常”限制内变化，但可能存在异常情况：

其他一些（可能是外部的）添加工作的组件可能会耗尽控制的
某些消除工作的组件卡住了，但仍未被发现

该过程的统计特征基本上是未知的。

我正在寻找一种算法，该算法将可用空间的定时定期测量作为输入（欢迎输入的替代建议），并在出现“异常”且文件系统“异常”时生成警报作为输出。可能会被填满”。避免误报显然很重要，但避免误报几乎同样重要，以避免让收到警报的系统管理员的大脑麻木。

我很欣赏有替代解决方案，例如针对根本问题投入更多存储空间，但我实际上经历过 1000 次还不够的情况。

考虑存储的历史测量值的算法很好，尽管最小化历史数据量的动态算法是首选。

我已经接受了弗兰克的回答，现在正在重新开始深入研究他的参考文献。

我认为，有三种情况令人感兴趣，但不按顺序排列：

“哈罗德百货促销刚刚开始”场景：以一秒分辨率“超出范围”的活动高峰，但并不代表真实的情况资源枯竭的危险；
“全球变暖”情景：需要为（相对）稳定的增长做好规划； “
Google 正在向我发送一份未经请求的索引副本”场景：这将在相对较短的时间内耗尽我的所有资源，除非我采取措施阻止它。

从系统管理员的角度来看，这是（我认为）最有趣且最具挑战性的最后一个。

原文

This question is about a whole class of similar problems, but I'll ask it as a concrete example.

I have a server with a file system whose contents fluctuate. I need to monitor the available space on this file system to ensure that it doesn't fill up. For the sake of argument, let's suppose that if it fills up, the server goes down.

It doesn't really matter what it is -- it might, for example, be a queue of "work".

During "normal" operation, the available space varies within "normal" limits, but there may be pathologies:

Some other (possibly external)
component that adds work may run out
of control
Some component that removes work seizes up, but remains undetected

The statistical characteristics of the process are basically unknown.

What I'm looking for is an algorithm that takes, as input, timed periodic measurements of the available space (alternative suggestions for input are welcome), and produces as output, an alarm when things are "abnormal" and the file system is "likely to fill up". It is obviously important to avoid false negatives, but almost as important to avoid false positives, to avoid numbing the brain of the sysadmin who gets the alarm.

I appreciate that there are alternative solutions like throwing more storage space at the underlying problem, but I have actually experienced instances where 1000 times wasn't enough.

Algorithms which consider stored historical measurements are fine, although on-the-fly algorithms which minimise the amount of historic data are preferred.

I have accepted Frank's answer, and am now going back to the drawing-board to study his references in depth.

There are three cases, I think, of interest, not in order:

The "Harrods' Sale has just started" scenario: a peak of activity that at one-second resolution is "off the dial", but doesn't represent a real danger of resource depletion;
The "Global Warming" scenario: needing to plan for (relatively) stable growth; and
The "Google is sending me an unsolicited copy of The Index" scenario: this will deplete all my resources in relatively short order unless I do something to stop it.

It's the last one that's (I think) most interesting, and challenging, from a sysadmin's point of view..

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

避讳 2024-07-17 14:01:31

如果它实际上与工作队列有关，那么排队论可能是找到答案的最佳途径。

对于一般情况，您也许可以尝试对历史数据进行（多个？）线性回归，以检测资源使用情况是否存在统计上显着的上升趋势，如果这种趋势持续下去，可能会导致问题（您也可以来预测该技术必须持续导致问题多长时间 - 只需为“问题”设置一个阈值并使用趋势的斜率来确定需要多长时间）。您必须尝试一下这一点以及您收集的变量，看看您是否可以首先发现任何统计上显着的关系。

尽管它涵盖了一个完全不同的主题（全球变暖），但我发现 tamino 的博客 (tamino.wordpress.com) 是一个非常好的资源，可以对充满已知和未知的数据进行统计分析。例如，请参阅这篇文章。

编辑：根据我的评论，我认为这个问题有点类似于 GW 问题。您有平均为零的短期突发活动，以及您感兴趣的长期趋势。此外，可能存在多个长期趋势，并且它会不时发生变化。塔米诺描述了一种可能适合于此的技术，但不幸的是我找不到我正在考虑的帖子。它涉及沿着数据进行滑动回归（想象一下适合噪声数据的多条线），并让数据选择拐点。如果你能做到这一点，那么你也许可以发现趋势的重大变化。不幸的是，它可能只能在事后才能识别，因为您可能需要积累大量数据才能获得意义。但阻止资源枯竭可能还来得及。至少它可以为您提供一种可靠的方法来确定您将来需要什么样的安全裕度和储备资源。

回复收藏 0 原文

~没有更多了~