快速从 data.frame 中删除零方差变量

发布于 2024-12-25 16:55:27 字数 1218 浏览 1 评论 0原文

我有一个大型 data.frame,它是由我无法控制的过程生成的,它可能包含也可能不包含零方差的变量(即所有观察结果都是相同的)。我想根据这些数据建立一个预测模型,显然这些变量没有用。

这是我当前用来从 data.frame 中删除此类变量的函数。它目前基于 apply,我想知道是否有任何明显的方法可以加快此函数的速度,以便它可以在具有大量(400 或 500)变量的非常大的数据集上快速运行?

set.seed(1)
dat <- data.frame(
    A=factor(rep("X",10),levels=c('X','Y')),
    B=round(runif(10)*10),
    C=rep(10,10),
    D=c(rep(10,9),1),
    E=factor(rep("A",10)),
    F=factor(rep(c("I","J"),5)),
    G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
    out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
    which(out==1)
}

这是该过程的结果:

> dat
   A B  C  D E F  G
1  X 3 10 10 A I 10
2  X 4 10 10 A J 10
3  X 6 10 10 A I 10
4  X 9 10 10 A J 10
5  X 2 10 10 A I 10
6  X 9 10 10 A J 10
7  X 9 10 10 A I 10
8  X 7 10 10 A J 10
9  X 6 10 10 A I 10
10 X 1 10  1 A J NA

> dat[,-zeroVar(dat)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

> dat[,-zeroVar(dat, useNA = 'no')]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the same). I would like to build a predictive model based on this data, and obviously these variables are of no use.

Here's the function I'm currently using to remove such variables from the data.frame. It's currently based on apply, and I was wondering if there are any obvious ways to speed this function up, so that it works quickly on very large datasets, with a large number (400 or 500) of variables?

set.seed(1)
dat <- data.frame(
    A=factor(rep("X",10),levels=c('X','Y')),
    B=round(runif(10)*10),
    C=rep(10,10),
    D=c(rep(10,9),1),
    E=factor(rep("A",10)),
    F=factor(rep(c("I","J"),5)),
    G=c(rep(10,9),NA)
)
zeroVar <- function(data, useNA = 'ifany') {
    out <- apply(data, 2, function(x) {length(table(x, useNA = useNA))})
    which(out==1)
}

And here's the result of the process:

> dat
   A B  C  D E F  G
1  X 3 10 10 A I 10
2  X 4 10 10 A J 10
3  X 6 10 10 A I 10
4  X 9 10 10 A J 10
5  X 2 10 10 A I 10
6  X 9 10 10 A J 10
7  X 9 10 10 A I 10
8  X 7 10 10 A J 10
9  X 6 10 10 A I 10
10 X 1 10  1 A J NA

> dat[,-zeroVar(dat)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

> dat[,-zeroVar(dat, useNA = 'no')]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

澉约 2025-01-01 16:55:27

您可能还想查看插入符包中的 nearZeroVar() 函数。

如果 1000 个事件中有一个,则丢弃这些数据可能是个好主意(但这取决于模型)。 nearZeroVar() 可以做到这一点。

You may also want to look into the nearZeroVar() function in the caret package.

If you have one event out of 1000, it might be a good idea to discard these data (but this depends on the model). nearZeroVar() can do that.

骷髅 2025-01-01 16:55:27

不要使用 table() - 对于此类事情来说非常慢。一个选项是 length(unique(x))

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

在示例数据集上,它比您的速度快一个数量级,同时提供类似的输出:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon 的解决方案在此示例中同样快速:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

但您会必须看看它们的规模是否与实际问题的规模相似。

Don't use table() - very slow for such things. One option is length(unique(x)):

foo <- function(dat) {
    out <- lapply(dat, function(x) length(unique(x)))
    want <- which(!out > 1)
    unlist(want)
}

system.time(replicate(1000, zeroVar(dat)))
system.time(replicate(1000, foo(dat)))

Which is an order magnitude faster than yours on the example data set whilst giving similar output:

> system.time(replicate(1000, zeroVar(dat)))
   user  system elapsed 
  3.334   0.000   3.335 
> system.time(replicate(1000, foo(dat)))
   user  system elapsed 
  0.324   0.000   0.324

Simon's solution here is similarly quick on this example:

> system.time(replicate(1000, which(!unlist(lapply(dat, 
+             function(x) 0 == var(if (is.factor(x)) as.integer(x) else x))))))
   user  system elapsed 
  0.392   0.000   0.395

but you'll have to see if they scale similarly to real problem sizes.

成熟稳重的好男人 2025-01-01 16:55:27

根本不要使用 table - 它在数字向量上非常慢,因为它将它们转换为字符串。我可能会使用类似的方法,

var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))

对于 0 方差,为 TRUE;对于具有 NA 的列,使用 NA;对于非零方差,使用 FALSE

Simply don't use table - it's extremely slow on numeric vectors since it converts them to strings. I would probably use something like

var0 <- unlist(lapply(df, function(x) 0 == var(if (is.factor(x)) as.integer(x) else x)))

It will be TRUE for 0-variance, NA for columns with NAs and FALSE for non-zero variance

放血 2025-01-01 16:55:27

使用 Caret 包和函数 nearZeroVar

require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ] 
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]

Use the Caret Package and the function nearZeroVar

require(caret)
NZV<- nearZeroVar(dataset, saveMetrics = TRUE)
NZV[NZV[,"zeroVar"] > 0, ] 
NZV[NZV[,"zeroVar"] + NZV[,"nzv"] > 0, ]
浅忆流年 2025-01-01 16:55:27

因为我是一个一直在谷歌上搜索同一问题的白痴,所以让我留下一个我已经确定的 tidyverse 方法:

library(tidyverse)

df <- df %>%
  select(
    - {
      df %>%
        map_dbl(~ length(table(.x, useNA = "ifany"))) %>%
        {which(. == 1)} %>%
        names()
    }
  )

我认为这可以缩短,但我太累了!

Because I'm an idiot who keeps googling the same question, let me leave a tidyverse approach that I've settled on:

library(tidyverse)

df <- df %>%
  select(
    - {
      df %>%
        map_dbl(~ length(table(.x, useNA = "ifany"))) %>%
        {which(. == 1)} %>%
        names()
    }
  )

I think this could be made shorter but I'm too tired!

场罚期间 2025-01-01 16:55:27

好吧,为自己节省一些编码时间:

Rgames: foo
      [,1]  [,2] [,3]
 [1,]    1 1e+00    1
 [2,]    1 2e+00    1
 [3,]    1 3e+00    1
 [4,]    1 4e+00    1
 [5,]    1 5e+00    1
 [6,]    1 6e+00    2
 [7,]    1 7e+00    3
 [8,]    1 8e+00    1
 [9,]    1 9e+00    1
 [10,]    1 1e+01    1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
 Use apply(*, 2, sd) instead.   

为了避免令人讨厌的浮点舍入,请采用该输出向量(我将其称为“bar”)并执行类似 bar[bar<; 2*.Machine$double.eps] <- 0 最后你的数据框 dat[,as.ological(bar)] 应该可以解决问题。

Well, save yourself some coding time:

Rgames: foo
      [,1]  [,2] [,3]
 [1,]    1 1e+00    1
 [2,]    1 2e+00    1
 [3,]    1 3e+00    1
 [4,]    1 4e+00    1
 [5,]    1 5e+00    1
 [6,]    1 6e+00    2
 [7,]    1 7e+00    3
 [8,]    1 8e+00    1
 [9,]    1 9e+00    1
 [10,]    1 1e+01    1
Rgames: sd(foo)
[1] 0.000000e+00 3.027650e+00 6.749486e-01
Warning message:
sd(<matrix>) is deprecated.
 Use apply(*, 2, sd) instead.   

To avoid nasty floating-point roundoffs, take that output vector, which I'll call "bar," and do something like bar[bar< 2*.Machine$double.eps] <- 0 and then finally your data frame dat[,as.logical(bar)] should do the trick.

情徒 2025-01-01 16:55:27

如何使用 factor 来计算唯一元素的数量并使用 sapply 循环:

dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

默认情况下排除 NA,但可以使用 exclude 更改此设置> 因子参数:

dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA

How about using factor to count the number of unique elements and looping with sapply:

dat[sapply(dat, function(x) length(levels(factor(x)))>1)]
   B  D F
1  3 10 I
2  4 10 J
3  6 10 I
4  9 10 J
5  2 10 I
6  9 10 J
7  9 10 I
8  7 10 J
9  6 10 I
10 1  1 J

NAs are excluded by default, but this can be changed with the exclude parameter of factor:

dat[sapply(dat, function(x) length(levels(factor(x,exclude=NULL)))>1)]
   B  D F  G
1  3 10 I 10
2  4 10 J 10
3  6 10 I 10
4  9 10 J 10
5  2 10 I 10
6  9 10 J 10
7  9 10 I 10
8  7 10 J 10
9  6 10 I 10
10 1  1 J NA
爱已欠费 2025-01-01 16:55:27

我认为零方差相当于恒定,并且无需进行任何算术运算就可以解决问题。我希望 range() 优于 var(),但我还没有验证这一点:

removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
  notConstant <- function(x) {
    if (is.factor(x)) x <- as.integer(x)
    return (0 != diff(range(x, na.rm=TRUE)))
  }
  bkeep <- sapply(a_dataframe, notConstant)
  if (verbose) {
    cat('removeConstantColumns: '
      , ifelse(all(bkeep)
        , 'nothing'
        , paste(names(a_dataframe)[!bkeep], collapse=',')
      , ' removed',  '\n')
  }
  return (a_dataframe[, bkeep])
}

I think having zero variance is equivalent to being constant and one can get around without doing any arithmetic operations at all. I would expect that range() outperforms var(), but I have not verified this:

removeConstantColumns <- function(a_dataframe, verbose=FALSE) {
  notConstant <- function(x) {
    if (is.factor(x)) x <- as.integer(x)
    return (0 != diff(range(x, na.rm=TRUE)))
  }
  bkeep <- sapply(a_dataframe, notConstant)
  if (verbose) {
    cat('removeConstantColumns: '
      , ifelse(all(bkeep)
        , 'nothing'
        , paste(names(a_dataframe)[!bkeep], collapse=',')
      , ' removed',  '\n')
  }
  return (a_dataframe[, bkeep])
}
旧情勿念 2025-01-01 16:55:27

检查这个自定义功能。我没有在具有 100 多个变量的数据帧上尝试它。

remove_low_variance_cols <- function(df, threshold = 0) {
  n <- Sys.time() #See how long this takes to run
  remove_cols <- df %>%
    select_if(is.numeric) %>%
    map_dfr(var) %>%
    gather() %>% 
    filter(value <= threshold) %>%
    spread(key, value) %>%
    names()

  if(length(remove_cols)) {
    print("Removing the following columns: ")
    print(remove_cols)
  }else {
    print("There are no low variance columns with this threshold")
  }
  #How long did this script take?
  print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
  return(df[, setdiff(names(df), remove_cols)])
}

Check this custom function. I did not try it on data frames with 100+ variables.

remove_low_variance_cols <- function(df, threshold = 0) {
  n <- Sys.time() #See how long this takes to run
  remove_cols <- df %>%
    select_if(is.numeric) %>%
    map_dfr(var) %>%
    gather() %>% 
    filter(value <= threshold) %>%
    spread(key, value) %>%
    names()

  if(length(remove_cols)) {
    print("Removing the following columns: ")
    print(remove_cols)
  }else {
    print("There are no low variance columns with this threshold")
  }
  #How long did this script take?
  print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
  return(df[, setdiff(names(df), remove_cols)])
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文