快速从 data.frame 中删除零方差变量
我有一个大型 data.frame,它是由我无法控制的过程生成的,它可能包含也可能不包含零方差的变量(即所有观察结果都是相同的)。我想根据这些数据建立一个预测模型,显然这些变量没有用。
这是我当前用来从 data.frame 中删除此类变量的函数。它目前基于 apply
,我想知道是否有任何明显的方法可以加快此函数的速度,以便它可以在具有大量(400 或 500)变量的非常大的数据集上快速运行?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
这是该过程的结果:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.
Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply
, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?
set.seed(1)
dat <- data.frame(
A=factor(rep("X",10),levels=c('X','Y')),
B=round(runif(10)*10),
C=rep(10,10),
D=c(rep(10,9),1),
E=factor(rep("A",10)),
F=factor(rep(c("I","J"),5)),
G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
which(out==1)
}
And here's the result of the process:
> dat
A B C D E F G
1 X 3 10 10 A I 10
2 X 4 10 10 A J 10
3 X 6 10 10 A I 10
4 X 9 10 10 A J 10
5 X 2 10 10 A I 10
6 X 9 10 10 A J 10
7 X 9 10 10 A I 10
8 X 7 10 10 A J 10
9 X 6 10 10 A I 10
10 X 1 10 1 A J NA
> dat[,-zeroVar(dat)]
B D F G
1 3 10 I 10
2 4 10 J 10
3 6 10 I 10
4 9 10 J 10
5 2 10 I 10
6 9 10 J 10
7 9 10 I 10
8 7 10 J 10
9 6 10 I 10
10 1 1 J NA
> dat[,-zeroVar(dat, useNA = 'no')]
B D F
1 3 10 I
2 4 10 J
3 6 10 I
4 9 10 J
5 2 10 I
6 9 10 J
7 9 10 I
8 7 10 J
9 6 10 I
10 1 1 J
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您可能还想查看插入符包中的
nearZeroVar()
函数。如果 1000 个事件中有一个,则丢弃这些数据可能是个好主意(但这取决于模型)。
nearZeroVar()
可以做到这一点。You may also want to look into the
nearZeroVar()
function in the caret package.If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model).
nearZeroVar()
can do that.不要使用
table()
- 对于此类事情来说非常慢。一个选项是length(unique(x))
:在示例数据集上,它比您的速度快一个数量级,同时提供类似的输出:
Simon 的解决方案在此示例中同样快速:
但您会必须看看它们的规模是否与实际问题的规模相似。
Don't use
table()
- very slow for such things. One option islength(unique(x))
:Which is an order magnitude faster than yours on the example data set whilst giving similar output:
Simon's solution here is similarly quick on this example:
but you'll have to see if they scale similarly to real problem sizes.
根本不要使用
table
- 它在数字向量上非常慢,因为它将它们转换为字符串。我可能会使用类似的方法,对于 0 方差,为
TRUE
;对于具有 NA 的列,使用NA
;对于非零方差,使用FALSE
Simply don't use
table
- it's extremely slow on numeric vectors since it converts them to strings. I would probably use something likeIt will be
TRUE
for 0-variance,NA
for columns with NAs andFALSE
for non-zero variance使用
Caret
包和函数nearZeroVar
Use the
Caret
Package and the functionnearZeroVar
因为我是一个一直在谷歌上搜索同一问题的白痴,所以让我留下一个我已经确定的
tidyverse
方法:我认为这可以缩短,但我太累了!
Because I'm an idiot who keeps googling the same question, let me leave a
tidyverse
approach that I've settled on:I think this could be made shorter but I'm too tired!
好吧,为自己节省一些编码时间:
为了避免令人讨厌的浮点舍入,请采用该输出向量(我将其称为“bar”)并执行类似
bar[bar<; 2*.Machine$double.eps] <- 0
最后你的数据框dat[,as.ological(bar)]
应该可以解决问题。Well, save yourself some coding time:
To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like
bar[bar< 2*.Machine$double.eps] <- 0
and then finally your data framedat[,as.logical(bar)]
should do the trick.如何使用
factor
来计算唯一元素的数量并使用sapply
循环:默认情况下排除 NA,但可以使用
exclude
更改此设置>因子
参数:How about using
factor
to count the number of unique elements and looping withsapply
:NAs are excluded by default, but this can be changed with the
exclude
parameter offactor
:我认为零方差相当于恒定,并且无需进行任何算术运算就可以解决问题。我希望 range() 优于 var(),但我还没有验证这一点:
I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:
检查这个自定义功能。我没有在具有 100 多个变量的数据帧上尝试它。
Check this custom function. I did not try it on data frames with 100+ variables.