当前位置：文江博客话题详情

异常检测算法

发布于 2024-10-05 21:42:26 字数 184 浏览 10 评论 0原文

我的任务是使用机器学习算法从各种格式的数据（例如电子邮件、即时消息等）中检测异常（已知或未知）。

您最喜欢和最有效的异常检测算法是什么？
它们的局限性和最佳点是什么？
您建议如何解决这些限制？

非常感谢所有建议。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

ぃ弥猫深巷。 2024-10-12 21:42:26

诸如贝叶斯过滤器之类的统计过滤器或某些垃圾邮件过滤器使用的某些混杂版本很容易实现。另外，还有很多关于它的在线文档。

最大的缺点是它无法真正检测未知的事物。您使用大量已知数据样本对其进行训练，以便它可以对新传入的数据进行分类。但是您可以将传统的垃圾邮件过滤器颠倒过来：训练它识别合法数据而不是非法数据，这样它无法识别的任何内容都是异常的。

回复收藏 0 原文

总攻大人 2024-10-12 21:42:26

异常检测算法有多种类型，具体取决于数据类型和您要解决的问题：

时间序列信号中的异常：
时间序列信号是您可以随时间绘制为线图的任何信号（例如，CPU 利用率、温度、每分钟电子邮件数量的速率、网页访问者的速率等）。示例算法有 Holt-Winters、ARIMA 模型、马尔可夫模型等。几个月前我就这个主题做了一次演讲——它可能会给你更多关于算法及其局限性的想法。
视频位于：https://www.youtube.com/watch?v=SrOM2z6h_RQ
表格数据中的异常：在这些情况下，您拥有描述某些内容的特征向量（例如，将电子邮件转换为描述它的特征向量：收件人数量、单词数量、大写单词数量、关键字计数， ETC....）。给定大量此类特征向量，您希望检测到与其余特征向量相比的一些异常特征（有时称为“异常值检测”）。几乎任何聚类算法都适合这些情况，但哪种算法最适合取决于特征的类型及其行为——实值特征、序数特征、名义特征或任何其他特征。特征的类型决定了某些距离函数是否合适（大多数聚类算法的基本要求），并且某些算法对于某些类型的特征比其他算法更好。
最简单的算法是 k 均值聚类，其中异常样本要么是非常小的聚类，要么是远离所有聚类中心的向量。单边 SVM 还可以检测异常值，并且可以灵活地选择不同的核（以及有效的不同距离函数）。另一种流行的算法是 DBSCAN。
当异常已知时，问题就变成了监督学习问题，因此您可以使用分类算法并在已知异常示例上对其进行训练。然而，如前所述，它只会检测那些已知的异常，如果异常的训练样本数量非常小，则训练后的分类器可能不准确。此外，由于与“无异常”相比，异常的数量通常非常小，因此在训练分类器时，您可能需要使用 boosting/bagging 等技术，对异常类进行过采样，但要在非常小的情况下进行优化误报率。文献中有各种各样的技术可以做到这一点——我发现其中一个非常有效的想法是 Viola-Jones 用于面部检测的方法——一系列分类器。请参阅：http://www.vision.caltech .edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf

（免责声明：我是 Anodot 的首席数据科学家，这是一家对时间序列数据进行实时异常检测的商业公司）。

There are various types of anomaly detection algorithms, depending on the type of data and the problem you are trying to solve:

Anomalies in time series signals:
Time series signals is anything you can draw as a line graph over time (e.g., CPU utilization, temperature, rate per minute of number of emails, rate of visitors on a webpage, etc). Example algorithms are Holt-Winters, ARIMA models, Markov Models, and more. I gave a talk on this subject a few months ago - it might give you more ideas about algorithms and their limitations.
The video is at: https://www.youtube.com/watch?v=SrOM2z6h_RQ
Anomalies in Tabular data: These are cases where you have feature vector that describe something (e.g, transforming an email to a feature vector that describes it: number of recipients, number of words, number of capitalized words, counts of keywords, etc....). Given a large set of such feature vectors, you want to detect some that are anomalies compared to the rest (sometimes called "outlier detection"). Almost any clustering algorithm is suitable in these cases, but which one would be most suitable depends on the type of features and their behavior -- real valued features, ordinal, nominal or anything other. The type of features determine if certain distance functions are suitable (the basic requirement for most clustering algorithms), and some algorithms are better with certain types of features than others.
The simplest algo to try is k-means clustering, where an anomaly sample would be either very small clusters or vectors that are far from all cluster centers. One sided SVM can also detect outliers, and has the flexibility of choosing different kernels (and effectively different distance functions). Another popular algo is DBSCAN.
When anomalies are known, the problem becomes a supervised learning problem, so you can use classification algorithms and train them on the known anomalies examples. However, as mentioned - it would only detect those known anomalies and if the number of training samples for anomalies is very small, the trained classifiers may not be accurate. Also, because the number of anomalies is typically very small compared to "no-anomalies", when training the classifiers you might want to use techniques like boosting/bagging, with over sampling of the anomalies class(es), but optimize on very small False Positive rate. There are various techniques to do it in the literature --- one idea that I found to work many times very well is what Viola-Jones used for face detection - a cascade of classifiers. see: http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf

(DISCLAIMER: I am the chief data scientist for Anodot, a commercial company doing real time anomaly detection for time series data).

回复收藏 0 原文

~没有更多了~