R:使用分位数 0.05 和 0.95 对数据框中的每一列进行异常值清理
我是R新手。在将样本放入随机森林之前,我想进行一些离群值清理和从 0 到 1 的总体缩放。
g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
如果我从 0 - 1 进行简单的缩放,结果将是:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0
所以我的想法是将每列大于 0.95 分位数的值替换为小于 0.95 分位数的下一个值 - 对于0.05 分位数。
因此,预缩放结果将是:
g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)
缩放后:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0
我需要整个数据帧的公式,因此 R 中的功能实现应该类似于:
> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))
任何人都可以帮忙吗?
旁边说:如果存在直接完成这项工作的函数,请告诉我。我已经检查过 cut
和 cut2
。 cut
由于不唯一的中断而失败; cut2
可以工作,但只返回字符串值或平均值,我需要一个从 0 - 1 的数字向量。
用于试用:
a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)
b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
c<-cbind(a,b)
c<-as.data.frame(c)
问候并感谢您的帮助,
Rainer
I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest.
g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
If i do a simple scaling from 0 - 1 the result would be:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0
So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0.95-quantile - and the same for the 0.05-quantile.
So the pre-scaled result would be:
g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)
and scaled:
> round((g - min(g))/abs(max(g) - min(g)),1)
[1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0
I need this formula for a whole dataframe, so the functional implementation within R should be something like:
> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))
Can anyone help?
Spoken beside: if there exists a function that does this job directly, please let me know. I already checked out cut
and cut2
. cut
fails because of not-unique breaks; cut2
would work, but only gives back string values or the mean value, and I need a numeric vector from 0 - 1.
for trial:
a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)
b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)
c<-cbind(a,b)
c<-as.data.frame(c)
Regards and thanks for help,
Rainer
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我想不出 R 中的函数可以执行此操作,但您可以自己定义一个小函数:
然后将其应用于数据框中的每个变量:
编辑:这个答案旨在解决编程问题问题。关于实际使用它,我完全同意哈德利的观点
I can't think of a function in R that does this, but you can define a small one yourself:
Then
sapply
this to each variable in your dataframe:Edit: This answer was meant to solve the programming problem. In regard to actually using it I fully agree with Hadley
请不要这样做。对于处理异常值来说,这不是一个好的策略 - 特别是因为 10% 的数据不太可能是异常值!
Please don't do this. This is not a good strategy for dealing with outliers - particularly since it's unlikely that 10% of your data are outliers!