异常检测算法
我的任务是使用机器学习算法从各种格式的数据(例如电子邮件、即时消息等)中检测异常(已知或未知)。
您最喜欢和最有效的异常检测算法是什么?
它们的局限性和最佳点是什么?
您建议如何解决这些限制?
非常感谢所有建议。
I am tasked with detecting anomalies (known or unknown) using machine-learning algorithms from data in various formats - e.g. emails, IMs etc.
What are your favorite and most effective anomaly detection algorithms?
What are their limitations and sweet-spots?
How would you recommend those limitations be addressed?
All suggestions very much appreciated.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
诸如贝叶斯过滤器之类的统计过滤器或某些垃圾邮件过滤器使用的某些混杂版本很容易实现。另外,还有很多关于它的在线文档。
最大的缺点是它无法真正检测未知的事物。您使用大量已知数据样本对其进行训练,以便它可以对新传入的数据进行分类。但是您可以将传统的垃圾邮件过滤器颠倒过来:训练它识别合法数据而不是非法数据,这样它无法识别的任何内容都是异常的。
Statistical filters like Bayesian filters or some bastardised version employed by some spam filters are easy to implement. Plus there are lots of online documentation about it.
The big downside is that it cannot really detect unknown things. You train it with a large sample of known data so that it can categorize new incoming data. But you can turn the traditional spam filter upside down: train it to recognize legitimate data instead of illegitimate data so that anything it doesn't recognize is an anomaly.
异常检测算法有多种类型,具体取决于数据类型和您要解决的问题:
时间序列信号中的异常:
时间序列信号是您可以随时间绘制为线图的任何信号(例如,CPU 利用率、温度、每分钟电子邮件数量的速率、网页访问者的速率等)。示例算法有 Holt-Winters、ARIMA 模型、马尔可夫模型等。几个月前我就这个主题做了一次演讲——它可能会给你更多关于算法及其局限性的想法。
视频位于:https://www.youtube.com/watch?v=SrOM2z6h_RQ
表格数据中的异常:在这些情况下,您拥有描述某些内容的特征向量(例如,将电子邮件转换为描述它的特征向量:收件人数量、单词数量、大写单词数量、关键字计数, ETC....)。给定大量此类特征向量,您希望检测到与其余特征向量相比的一些异常特征(有时称为“异常值检测”)。几乎任何聚类算法都适合这些情况,但哪种算法最适合取决于特征的类型及其行为——实值特征、序数特征、名义特征或任何其他特征。特征的类型决定了某些距离函数是否合适(大多数聚类算法的基本要求),并且某些算法对于某些类型的特征比其他算法更好。
最简单的算法是 k 均值聚类,其中异常样本要么是非常小的聚类,要么是远离所有聚类中心的向量。单边 SVM 还可以检测异常值,并且可以灵活地选择不同的核(以及有效的不同距离函数)。另一种流行的算法是 DBSCAN。
(免责声明:我是 Anodot 的首席数据科学家,这是一家对时间序列数据进行实时异常检测的商业公司)。
There are various types of anomaly detection algorithms, depending on the type of data and the problem you are trying to solve:
Anomalies in time series signals:
Time series signals is anything you can draw as a line graph over time (e.g., CPU utilization, temperature, rate per minute of number of emails, rate of visitors on a webpage, etc). Example algorithms are Holt-Winters, ARIMA models, Markov Models, and more. I gave a talk on this subject a few months ago - it might give you more ideas about algorithms and their limitations.
The video is at: https://www.youtube.com/watch?v=SrOM2z6h_RQ
Anomalies in Tabular data: These are cases where you have feature vector that describe something (e.g, transforming an email to a feature vector that describes it: number of recipients, number of words, number of capitalized words, counts of keywords, etc....). Given a large set of such feature vectors, you want to detect some that are anomalies compared to the rest (sometimes called "outlier detection"). Almost any clustering algorithm is suitable in these cases, but which one would be most suitable depends on the type of features and their behavior -- real valued features, ordinal, nominal or anything other. The type of features determine if certain distance functions are suitable (the basic requirement for most clustering algorithms), and some algorithms are better with certain types of features than others.
The simplest algo to try is k-means clustering, where an anomaly sample would be either very small clusters or vectors that are far from all cluster centers. One sided SVM can also detect outliers, and has the flexibility of choosing different kernels (and effectively different distance functions). Another popular algo is DBSCAN.
(DISCLAIMER: I am the chief data scientist for Anodot, a commercial company doing real time anomaly detection for time series data).