统计检测数据异常的最佳方法
我们的网络应用程序收集大量有关用户操作、网络业务、数据库负载等的数据。
所有数据都存储在仓库中,我们对这些数据有很多有趣的视图。
如果发生奇怪的事情,它很可能会出现在数据中的某个地方。
然而,要手动检测是否发生异常情况,必须不断查看这些数据并寻找异常情况。
我的问题:检测可被视为“异常”的动态数据变化的最佳方法是什么。
贝叶斯过滤器(我在阅读垃圾邮件检测时提到过这些)是可行的方法吗?
任何指点都会很棒!
编辑: 为了澄清数据,例如显示数据库负载的每日曲线。 该曲线通常看起来与昨天的曲线相似 随着时间的推移,这条曲线可能会缓慢变化。
如果曲线每天在某些范围内发生变化,那么就可以发出警告,那就太好了。
右
our webapp collects huge amount of data about user actions, network business, database load, etc etc etc
All data is stored in warehouses and we have quite a lot of interesting views on this data.
if something odd happens chances are, it shows up somewhere in the data.
However, to manually detect if something out of the ordinary is going on, one has to continually look through this data, and look for oddities.
My question: what is the best way to detect changes in dynamic data which can be seen as 'out of the ordinary'.
Are bayesan filters (I've seen these mentioned when reading about spam detection) the way to go?
Any pointers would be great!
EDIT:
To clarify the data for example shows a daily curve of database load.
This curve typically looks similar to the curve from yesterday
In time this curve might change slowly.
It would be nice that if the curve from day to day changes say within some perimeters, a warning could go off.
R
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
看一下控制图,它们提供了一种直观地跟踪数据变化并指定何时进行的方法数据“失控”或“异常”。 它们在制造中大量使用,以确保质量控制。
Take a look at Control Charts, they provide a way to track changes in your data visually and specify when the data is "out of control" or "anomalous". They are heavily used in manufacturing to ensure quality control.
如果不了解更多有关您所拥有的特定数据的信息,则无法回答这个问题。 有关现有方法的概述,请参阅异常检测:Chandola、Banerjee 和 Kumar 的调查。
This question is impossible to answer without knowing much more about the particular data you have. For an overview of what kinds of approaches exist, see Anomaly Detection: A Survey by Chandola, Banerjee, and Kumar.
贝叶斯分类可能可以帮助您发现数据中的一些异常情况,具体取决于数据类型以及您训练贝叶斯过滤器的程度。
甚至还有一个可作为网络服务@uClassify.com。
Bayesian classification might help you find some anomalies in your data, depending on the type of data and how good you train your Bayesian filter.
There is even one available as a web service @ uClassify.com.
这在很大程度上取决于数据是什么。 参加统计学课程并首先学习基础知识。 这通常不是一个容易或简单的问题。
This depends so much on what the data is. Take a statistics class and learn the basics first. This isn't usually an easy or simple problem.