通用时间序列在线异常值检测的简单算法

发布于 2024-09-12 16:36:52 字数 440 浏览 8 评论 0原文

我正在处理大量的时间序列。 这些时间序列基本上是每 10 分钟进行一次的网络测量,其中一些是周期性的(即带宽),而另一些则不是周期性的(即路由流量)。

我想要一个简单的算法来进行在线“异常值检测”。基本上,我想将每个时间序列的整个历史数据保留在内存中(或磁盘上),并且我想检测实时场景中的任何异常值(每次捕获新样本时)。 实现这些结果的最佳方法是什么?

我目前正在使用移动平均线来消除一些噪音,但接下来该怎么办?像标准差、疯狂之类的简单的东西对整个数据集来说效果不佳(我不能假设时间序列是静止的),我想要一些更“准确”的东西,最好是一个黑匣子,比如:

double outlier_detection(double* vector, double value);

其中 vector 是包含历史数据的 double 数组,返回值是新样本 "value" 的异常分数。

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).

I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?

I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:

double outlier_detection(double* vector, double value);

where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笑脸一如从前 2024-09-19 16:36:52

这是一个庞大而复杂的主题,答案取决于 (a) 您想在这方面投入多少精力,以及 (b) 您希望异常值检测有多有效。一种可能的方法是自适应过滤,通常用于降噪耳机等应用。有一个不断适应输入信号的滤波器,有效地将其滤波器系数与信号源的假设短期模型相匹配,从而减少均方误差输出。然后,这将为您提供低电平输出信号(残余误差)除非当您获得异常值时,这将导致易于检测的尖峰(阈值)。阅读自适应过滤LMS 过滤器 等,如果您认真对待这种技术。

This is a big and complex subject, and the answer will depend on (a) how much effort you want to invest in this and (b) how effective you want your outlier detection to be. One possible approach is adaptive filtering, which is typically used for applications like noise cancelling headphones, etc. You have a filter which constantly adapts to the input signal, effectively matching its filter coefficients to a hypothetical short term model of the signal source, thereby reducing mean square error output. This then gives you a low level output signal (the residual error) except for when you get an outlier, which will result in a spike, which will be easy to detect (threshold). Read up on adaptive filtering, LMS filters, etc, if you're serious about this kind of technique.

卷耳 2024-09-19 16:36:52

我建议采用下面的方案,该方案应该在一天左右的时间内即可实现:

训练

  • 收集尽可能多的样本
  • 使用每个属性的标准差删除明显的异常值
  • 计算并存储相关矩阵以及每个属性的平均值
  • 计算并存储所有样本的马氏距离

计算“异常值”:

对于您想知道其“异常值”:

  • 检索均值、协方差矩阵和马氏距离根据训练
  • 计算样本的“d”的马氏距离
  • 返回“d”所处的百分位数跌倒(使用训练的马哈拉诺比斯距离)

这将是您的异常值:100% 是极端异常值。


PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

I suggest the scheme below, which should be implementable in a day or so:

Training

  • Collect as many samples as you can hold in memory
  • Remove obvious outliers using the standard deviation for each attribute
  • Calculate and store the correlation matrix and also the mean of each attribute
  • Calculate and store the Mahalanobis distances of all your samples

Calculating "outlierness":

For the single sample of which you want to know its "outlierness":

  • Retrieve the means, covariance matrix and Mahalanobis distances from training
  • Calculate the Mahalanobis distance "d" for your sample
  • Return the percentile in which "d" falls (using the Mahalanobis distances from training)

That will be your outlier score: 100% is an extreme outlier.


PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文