R：使用分位数 0.05 和 0.95 对数据框中的每一列进行异常值清理

发布于 2024-10-21 06:53:48 字数 1242 浏览 8 评论 0原文

我是R新手。在将样本放入随机森林之前，我想进行一些离群值清理和从 0 到 1 的总体缩放。

g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

如果我从 0 - 1 进行简单的缩放，结果将是：

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0

所以我的想法是将每列大于 0.95 分位数的值替换为小于 0.95 分位数的下一个值 - 对于0.05 分位数。

因此，预缩放结果将是：

g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)

缩放后：

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0

我需要整个数据帧的公式，因此 R 中的功能实现应该类似于：

> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))

任何人都可以帮忙吗？

旁边说：如果存在直接完成这项工作的函数，请告诉我。我已经检查过 cut 和 cut2。 cut 由于不唯一的中断而失败； cut2 可以工作，但只返回字符串值或平均值，我需要一个从 0 - 1 的数字向量。

用于试用：

a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)

b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

c<-cbind(a,b)

c<-as.data.frame(c)

问候并感谢您的帮助，

Rainer

原文

I am a R-novice. I want to do some outlier cleaning and over-all-scaling from 0 to 1 before putting the sample into a random forest.

g<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

If i do a simple scaling from 0 - 1 the result would be:

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.1 0.0 0.1 0.0 0.0 0.0 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.0

So my idea is to replace the values of each column that are greater than the 0.95-quantile with the next value smaller than the 0.95-quantile - and the same for the 0.05-quantile.

So the pre-scaled result would be:

g<-c(**70**,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,**40**)

and scaled:

> round((g - min(g))/abs(max(g) - min(g)),1)

 [1] 1.0 0.7 0.3 0.7 0.3 0.0 0.3 0.7 1.0 0.7 0.0 1.0 0.3 0.7 0.3 1.0 0.0

I need this formula for a whole dataframe, so the functional implementation within R should be something like:

> apply(c, 2, function(x) x[x`<quantile(x, 0.95)]`<-max(x[x, ... max without the quantile(x, 0.95))

Can anyone help?

Spoken beside: if there exists a function that does this job directly, please let me know. I already checked out cut and cut2. cut fails because of not-unique breaks; cut2 would work, but only gives back string values or the mean value, and I need a numeric vector from 0 - 1.

for trial:

a<-c(100,6,5,6,5,4,5,6,7,6,4,7,5,6,5,7,1)

b<-c(1000,60,50,60,50,40,50,60,70,60,40,70,50,60,50,70,10)

c<-cbind(a,b)

c<-as.data.frame(c)

Regards and thanks for help,

Rainer

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

妖妓 2024-10-28 06:53:49

我想不出 R 中的函数可以执行此操作，但您可以自己定义一个小函数：

foo <- function(x)
{
    quant <- quantile(x,c(0.05,0.95))
    x[x < quant[1]] <- min(x[x >= quant[1]])
    x[x > quant[2]] <- max(x[x <= quant[2]])
    return(round((x - min(x))/abs(max(x) - min(x)),1))
}

然后将其应用于数据框中的每个变量：

sapply(c,foo)
       a   b
 [1,] 1.0 1.0
 [2,] 0.7 0.7
 [3,] 0.3 0.3
 [4,] 0.7 0.7
 [5,] 0.3 0.3
 [6,] 0.0 0.0
 [7,] 0.3 0.3
 [8,] 0.7 0.7
 [9,] 1.0 1.0
[10,] 0.7 0.7
[11,] 0.0 0.0
[12,] 1.0 1.0
[13,] 0.3 0.3
[14,] 0.7 0.7
[15,] 0.3 0.3
[16,] 1.0 1.0
[17,] 0.0 0.0

编辑：这个答案旨在解决编程问题问题。关于实际使用它，我完全同意哈德利的观点

I can't think of a function in R that does this, but you can define a small one yourself:

foo <- function(x)
{
    quant <- quantile(x,c(0.05,0.95))
    x[x < quant[1]] <- min(x[x >= quant[1]])
    x[x > quant[2]] <- max(x[x <= quant[2]])
    return(round((x - min(x))/abs(max(x) - min(x)),1))
}

Then sapply this to each variable in your dataframe:

sapply(c,foo)
       a   b
 [1,] 1.0 1.0
 [2,] 0.7 0.7
 [3,] 0.3 0.3
 [4,] 0.7 0.7
 [5,] 0.3 0.3
 [6,] 0.0 0.0
 [7,] 0.3 0.3
 [8,] 0.7 0.7
 [9,] 1.0 1.0
[10,] 0.7 0.7
[11,] 0.0 0.0
[12,] 1.0 1.0
[13,] 0.3 0.3
[14,] 0.7 0.7
[15,] 0.3 0.3
[16,] 1.0 1.0
[17,] 0.0 0.0

Edit: This answer was meant to solve the programming problem. In regard to actually using it I fully agree with Hadley

回复收藏 0 原文