如何在 R 代码中使用异常值测试
作为数据分析工作流程的一部分,我想测试异常值,然后在有或没有这些异常值的情况下进行进一步的计算。
我找到了异常值包,其中有各种测试,但我不确定如何最好地将它们用于我的工作流程。
As part of my data analysis workflow, I want to test for outliers, and then do my further calculation with and without those outliers.
I've found the outlier package, which has various tests, but I'm not sure how best to use them for my workflow.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您担心异常值,请使用稳健的方法,而不是将其丢弃。例如,使用 rlm 代替 lm。
If you're worried about outliers, instead on throwing them out, use a robust method. For example, instead of lm, use rlm.
我同意德克的观点,这很难。我建议首先看看为什么你可能会有异常值。异常值只是有人认为可疑的数字,它并不是具体的“坏”值,除非您能找到它成为异常值的原因,否则您可能不得不忍受这种不确定性。
您没有提到的一件事是您正在寻找什么样的异常值。您的数据是否围绕平均值聚集,它们是否具有特定的分布,或者您的数据之间是否存在某种关系。
这是一些示例
首先,我们将创建一些数据,然后用异常值来污染它;
以图形方式检查数据通常最有用(您的大脑比数学更擅长发现异常值),
然后您可以使用测试。如果测试返回截止值,或者可能是异常值的值,您可以使用 ifelse 将其删除
或者对于更复杂的示例,您可以使用 stats 来计算临界截止值,此处使用 Lund 测试(请参阅 Lund ,RE 1975,“线性模型中离群值的近似检验表”,Technometrics,第 17 卷,第 4 期,第 473-476 页和 Prescott,P. 1975,“线性模型中离群值的近似检验”。 ,《技术计量学》,第 17 卷,第 1 期,第 129-132 页。)
编辑:我刚刚注意到我的代码中的一个问题。隆德检验产生一个临界值,应与统计残差的绝对值(即无符号)进行比较
I agree with Dirk, It's hard. I would recomend first looking at why you might have outliers. An outlier is just a number that someone thinks is suspicious, it's not a concrete 'bad' value, and unless you can find a reason for it to be an outlier, you may have to live with the uncertainty.
One thing you didn't mention was what kind of outlier you're looking at. Are your data clustered around a mean, do they have a particular distribution or is there some relationship between your data.
Here's some examples
First, we'll create some data, and then taint it with an outlier;
It's often most usefull to examine the data graphically (you're brain is much better at spotting outliers than maths is)
Then you can use a test. If the test returns a cut off value, or the value that might be an outlier, you can use ifelse to remove it
Or for more complicated examples, you can use stats to calculate critical cut off values, here using the Lund Test (See Lund, R. E. 1975, "Tables for An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 4, pp. 473-476. and Prescott, P. 1975, "An Approximate Test for Outliers in Linear Models", Technometrics, vol. 17, no. 1, pp. 129-132.)
Edit: I've just noticed an issue in my code. The Lund test produces a critical value that should be compared to the absolute value of the studantized residual (i.e. without sign)
“这很难”。其中大部分内容取决于上下文,您可能必须将其嵌入到您的应用程序中:
除了异常值包之外,还有 qcc 包作为质量控制文献涵盖了这个领域。
您还可以查看许多其他区域,例如稳健的统计任务视图。
"It's hard". Much of this is context-dependent and you may have to embed this into your application:
Other than the outliers packages there is also the qcc package as the quality control literature covers this area.
There are many other areas you could look at as e.g. the robust statistics Task View.
尝试使用
outliers::score
函数。我不建议删除所谓的异常值,但了解你的极端观察结果是有好处的。您将在此处找到有关异常值检测的更多帮助
Try the
outliers::score
function. I don't advise removing the so called outlier's, but knowing your extreme observations is good.You'll find more help with outlier detection here