通用时间序列在线异常值检测的简单算法

发布于 2024-09-12 16:36:52 字数 440 浏览 8 评论 0原文

我正在处理大量的时间序列。这些时间序列基本上是每 10 分钟进行一次的网络测量，其中一些是周期性的（即带宽），而另一些则不是周期性的（即路由流量）。

我想要一个简单的算法来进行在线“异常值检测”。基本上，我想将每个时间序列的整个历史数据保留在内存中（或磁盘上），并且我想检测实时场景中的任何异常值（每次捕获新样本时）。实现这些结果的最佳方法是什么？

我目前正在使用移动平均线来消除一些噪音，但接下来该怎么办？像标准差、疯狂之类的简单的东西对整个数据集来说效果不佳（我不能假设时间序列是静止的），我想要一些更“准确”的东西，最好是一个黑匣子，比如：

double outlier_detection(double* vector, double value);

其中 vector 是包含历史数据的 double 数组，返回值是新样本 "value" 的异常分数。

原文

I am working with a large amount of time series.
These time series are basically network measurements coming every 10 minutes, and some of them are periodic (i.e. the bandwidth), while some other aren't (i.e. the amount of routing traffic).

I would like a simple algorithm for doing an online "outlier detection". Basically, I want to keep in memory (or on disk) the whole historical data for each time series, and I want to detect any outlier in a live scenario (each time a new sample is captured).
What is the best way to achieve these results?

I'm currently using a moving average in order to remove some noise, but then what next? Simple things like standard deviation, mad, ... against the whole data set doesn't work well (I can't assume the time series are stationary), and I would like something more "accurate", ideally a black box like:

double outlier_detection(double* vector, double value);

where vector is the array of double containing the historical data, and the return value is the anomaly score for the new sample "value" .

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

笑脸一如从前 2024-09-19 16:36:52

这是一个庞大而复杂的主题，答案取决于 (a) 您想在这方面投入多少精力，以及 (b) 您希望异常值检测有多有效。一种可能的方法是自适应过滤，通常用于降噪耳机等应用。有一个不断适应输入信号的滤波器，有效地将其滤波器系数与信号源的假设短期模型相匹配，从而减少均方误差输出。然后，这将为您提供低电平输出信号（残余误差）除非当您获得异常值时，这将导致易于检测的尖峰（阈值）。阅读自适应过滤、LMS 过滤器等，如果您认真对待这种技术。

回复收藏 0 原文

卷耳 2024-09-19 16:36:52

我建议采用下面的方案，该方案应该在一天左右的时间内即可实现：

训练

收集尽可能多的样本
使用每个属性的标准差删除明显的异常值
计算并存储相关矩阵以及每个属性的平均值
计算并存储所有样本的马氏距离

计算“异常值”：

对于您想知道其“异常值”：

检索均值、协方差矩阵和马氏距离根据训练
计算样本的“d”的马氏距离
返回“d”所处的百分位数跌倒（使用训练的马哈拉诺比斯距离）

这将是您的异常值：100% 是极端异常值。

PS. In calculating the Mahalanobis distance, use the correlation matrix, not the covariance matrix. This is more robust if the sample measurements vary in unit and number.