什么是检测异常的好算法?

发布于 2024-09-24 19:33:52 字数 362 浏览 9 评论 0原文

背景

问题是这样的:

  1. 黑匣子每天输出一个新数字。
  2. 这些数字已经记录了一段时间。
  3. 检测黑匣子中的新数字何时超出一段时间内建立的数字模式。

数字是整数,时间段是一年。

问题

什么算法可以识别数字中的模式?

该模式可能很简单,例如始终升序或始终降序,或者数字可能落在一个狭窄的范围内,等等。

想法

我有一些想法,但不确定最佳方法,或者已经存在哪些解决方案:

  • 机器学习算法?
  • 神经网络?
  • 区分正常数和异常数?
  • 统计分析?

Background

Here is the problem:

  1. A black box outputs a new number each day.
  2. Those numbers have been recorded for a period of time.
  3. Detect when a new number from the black box falls outside the pattern of numbers established over the time period.

The numbers are integers, and the time period is a year.

Question

What algorithm will identify a pattern in the numbers?

The pattern might be simple, like always ascending or always descending, or the numbers might fall within a narrow range, and so forth.

Ideas

I have some ideas, but am uncertain as to the best approach, or what solutions already exist:

  • Machine learning algorithms?
  • Neural network?
  • Classify normal and abnormal numbers?
  • Statistical analysis?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

灯下孤影 2024-10-01 19:33:52

对数据进行集群

如果您不知道数据将具有多少种模式,请使用高斯混合模型 (GMM) 之类的模型以及评分函数(例如贝叶斯信息准则 (BIC)),以便您可以自动检测数据中可能的聚类数量。您的数据。如果您不知道 k 可能是什么值,我建议您使用此方法而不是 k 均值。一旦您为过去一年的数据构建了 GMM,给定一个新的数据点 x,您就可以计算它由任何一个集群生成的概率(由高斯模型中的高斯MM)。如果您的新数据点由任何一个集群生成的概率都很低,则它很可能是真正的异常值。

如果这听起来有点太复杂,您会很高兴知道用于自动集群识别的整个 GMM + BIC 过程已在 excellent MCLUST 包用于R。我已经多次使用它来解决此类问题并取得了巨大成功。

它不仅允许您识别异常值,而且如果您在某个时候需要(或想要)此功能,您还可以将 p 值放在异常值点上。

Cluster your data.

If you don't know how many modes your data will have, use something like a Gaussian Mixture Model (GMM) along with a scoring function (e.g., Bayesian Information Criterion (BIC)) so you can automatically detect the likely number of clusters in your data. I recommend this instead of k-means if you have no idea what value k is likely to be. Once you've constructed a GMM for you data for the past year, given a new datapoint x, you can calculate the probability that it was generated by any one of the clusters (modeled by a Gaussian in the GMM). If your new data point has low probability of being generated by any one of your clusters, it is very likely a true outlier.

If this sounds a little too involved, you will be happy to know that the entire GMM + BIC procedure for automatic cluster identification has been implemented for you in the excellent MCLUST package for R. I have used it several times to great success for such problems.

Not only will it allow you to identify outliers, you will have the ability to put a p-value on a point being an outlier if you need this capability (or want it) at some point.

归属感 2024-10-01 19:33:52

您可以尝试使用线性回归进行线拟合预测,看看效果如何,这将相当容易以您选择的语言实施。
将一条线拟合到数据后,您可以计算沿线的平均值标准差
如果新点位于趋势线 +- 标准差上,则不应视为异常。

PCA 是处理此类数据时想到的另一种技术。

您还可以查看无监督学习。这是一种机器学习技术,可用于检测较大数据集中的差异。

听起来是一个有趣的问题!祝你好运

You could try line fitting prediction using linear regression and see how it goes, it would be fairly easy to implement in your language of choice.
After you fitted a line to your data, you could calculate the mean standard deviation along the line.
If the novel point is on the trend line +- the standard deviation, it should not be regarded as an abnormality.

PCA is an other technique that comes to mind, when dealing with this type of data.

You could also look in to unsuperviced learning. This is a machine learning technique that can be used to detect differences in larger data sets.

Sounds like a fun problem! Good luck

-黛色若梦 2024-10-01 19:33:52

你提到的所有技术都没有什么神奇之处。我相信您应该首先尝试缩小可能遇到的典型异常的范围,这有助于使事情变得简单。

然后,您可能想要计算与这些特征相关的派生量。例如:“我想检测方向突然变化的数字”=>计算 u_{n+1} - u_n,并期望它具有恒定的符号,或者落在某个范围内。您可能希望保持这种灵活性,并允许您的代码设计可扩展(如果您进行 OOP,策略模式可能值得一看)

然后,当您有一些感兴趣的派生量时,您可以对它们进行统计分析。例如,对于派生量 A,您假设它应该具有某种分布 P(a, b)(uniform([a, b]) 或 Beta(a, b),可能更复杂),您可以采用先验定律在a、b上,你根据连续的信息调整它们。然后,最后添加的点提供的信息的后验可能性应该能让您了解它是否正常。每个步骤的后验定律和先验定律之间的相对熵也是值得监控的一件好事。有关更多信息,请参阅有关贝叶斯方法的书籍。

如果你想检测异常值,我认为复杂的传统机器学习内容(感知器层或 SVM 仅引用它们)没有什么意义。这些方法在对已知相当干净的数据进行分类时效果很好。

There is little magic in all the techniques you mention. I believe you should first try to narrow the typical abnormalities you may encounter, it helps keeping things simple.

Then, you may want to compute derived quantities relevant to those features. For instance: "I want to detect numbers changing abruptly direction" => compute u_{n+1} - u_n, and expect it to have constant sign, or fall in some range. You may want to keep this flexible, and allow your code design to be extensible (Strategy pattern may be worth looking at if you do OOP)

Then, when you have some derived quantities of interest, you do statistical analysis on them. For instance, for a derived quantity A, you assume it should have some distribution P(a, b) (uniform([a, b]), or Beta(a, b), possibly more complex), you put a priori laws on a, b and you ajust them based on successive information. Then, the posterior likelihood of the info provided by the last point added should give you some insight about it being normal or not. Relative entropy between posterior and prior law at each step is a good thing to monitor too. Consult a book on Bayesian methods for more info.

I see little point in complex traditional machine learning stuff (perceptron layers or SVM to cite only them) if you want to detect outliers. These methods work great when classifying data which is known to be reasonably clean.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文